En este ejercicio (mucho menos guiado que los anteriores), tu objetivo es construir un detector de intrusiones cibernéticas.
Para ello, utilizarás un conocido (pero modificado) dataset conocido como el Dataset KDD-CUP-99. Deberás construir un detector de intrusiones de red: un modelo de machine learning que sea capaz de distinguir entre conexiones "maliciosas" (attacks) y conexiones "buenas" (normales, corrientes y legítimas).
El dataset contiene casi 20.000 conexiones (una por fila/observación) recopiladas a lo largo de un período de tiempo, y auditadas por expertos en ciberseguridad; de forma que los expertos han etiquetado cada conexión como attack o normal en base a si es una conexión maliciosa o una normal. Dado que se trata de una tarea muy rutinaria y aburrida, sería ideal utilizar los datos ya etiquetados por expertos para ser capaz de entrenar un modelo de machine learning que pueda predecir en conexiones futuras si se tratan de ataques o no, de forma automática.
Cada fila del CSV (KDD_dataset.csv, separado por punto y coma) contiene por tanto información de una conexión. Aparte de aparecer si se trata o no de un ataque (que es la variable a predecir type, la última columna), cada conexión contiene muchas columnas con información adicional de la conexión. Puedes (¡y debes!) usar estas columnas, con el objetivo de predecir lo mejor posible si dicha conexión se trata de un ataque o no en base a dichas features.
El dataset contiene por tanto 46 columnas: 45 atributos de la conexión, y la variable a predecir type. La descripción de lo que significa cada columna puedes verla en la siguiente tabla (la cual he construido a partir de la información oficial del dataset):
| columna (o familia de columnas) | descripción |
|---|---|
| duration | length (number of seconds) of the connection |
| protocol_type | type of the protocol, e.g. tcp, udp, etc. |
| service | network service on the destination, e.g., http, telnet, etc. |
| src_bytes | number of data bytes from source to destination |
| dst_bytes | number of data bytes from destination to source |
| flag | normal or error status of the connection |
| land | 1 if connection is from/to the same host/port; 0 otherwise |
| wrong_fragment | number of "wrong" fragments |
| urgent | number of urgent packets |
| hot | number of "hot" indicators |
| num_failed_logins | number of failed login attempts |
| logged_in | 1 if successfully logged in; 0 otherwise |
| num_compromised | number of "compromised" conditions |
| root_shell | 1 if root shell is obtained; 0 otherwise |
| su_attempted | 1 if "su root" command attempted; 0 otherwise |
| num_root | number of "root" accesses |
| num_file_creations | number of file creation operations |
| num_shells | number of shell prompts |
| num_access_files | number of operations on access control files |
| num_outbound_cmds | number of outbound commands in an ftp session |
| is_hot_login | 1 if the login belongs to the "hot" list; 0 otherwise |
| is_guest_login | 1 if the login is a "guest" login; 0 otherwise |
| count | number of connections to the same host as the current connection in the past two seconds |
| srv_count | number of connections to the same service as the current connection in the past two seconds |
| serror_rate | % of same-host connections that have "SYN" errors |
| srv_serror_rate | % of same-service connections that have "REJ" errors |
| rerror_rate | % of same-host connections that have "REJ" errors |
| srv_rerror_rate | % of same-service connections that have "REJ" errors |
| same_srv_rate | % of connections to the same service |
| diff_srv_rate | % of connections to different services |
| srv_diff_host_rate | % of connections to different hosts |
Si miras la tabla de arriba, verás que no tiene 45 entradas, sino que son algunas menos. Esto se debe a que, a partir de alguna de las entradas, hemos generado por ti más de una feature que pueden usar tus modelos. Por ejemplo: en la tabla de arriba puedes ver que hay una entrada que es protocol_type; que es, según la descripción, el tipo de protocolo de red que ha usado la conexión: tcp, udp, etcétera. Nosotros hemos hecho One-Hot-Encoding de este tipo de variables por ti; de modo que en vez de encontrar una columna que es protocol_type, te vas a encontrar protocol_type__tcp, protocol_type__udp y protocol_type__icmp; y cada una de ellas tiene valor de 1 (si se trata efectivamente de una conexión de ese tipo), y de 0 si no e así.
Por ejemplo: para las primeras dos conexiones del dataset nos encontramos:
| protocol_type__tcp | protocol_type__udp | protocol_type__icmp |
|---|---|---|
| 0 | 1 | 0 |
| 1 | 0 | 0 |
Lo cual quiere decir que la primera conexión se trata de una udp; y la segunda de una tcp. De esta forma, lo que originariamente sería solo una feature de tipo de dato string (protocol_type, que tomaría valores de strings como tcp, udp o icmp), ahora son varias features; pero numéricas, que podemos meter directamente en los modelos de machine learning. Puedes leer más sobre One-Hot-Encoding en la versión extendida de las diapositivas de machine learning.
Sobra decir que realmente no tienes por qué saber nada de ciberseguridad para poder hacerlo, ni tienes que entender qué es exactamente cada feature; asumiendo que las features están bien recopiladas, una de las gracias del machine learning es que puedes ser capaz de realizar acciones basadas en datos sin tener conocimiento extenso del campo donde lo aplicas.
type, ya que de momento desconocemos si son intrusiones o no), las cuales están disponibles en el archivo nuevas_conexiones.csv y guárdalas de vuelta en un archivo CSV que se llame predicciones_nuevas_conexiones.csv que incluya todas las features de dichas observaciones junto con una nueva columna llamada prediccion_ml, que contenga las predicciones.Y nada más. ¡Buena suerte!
Directriz o comando magic para que los gráficos los deje metidos dentro de este notebook y no en una pantalla aparte:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Se configuran un par de opciones:
pd.set_option("display.max_rows", 500) # Cuántas filas de los DataFrames saca pandas en el Notebook
plt.style.use("ggplot")
Se lee el dataset desde el directorio de trabajo:
conexiones_dataset = pd.read_csv("KDD_dataset.csv", sep=";") # Las 19534 conexiones están separadas por punto y coma
Se observa cómo es el dataset:
conexiones_dataset
| duration | protocol_type__tcp | protocol_type__udp | protocol_type__icmp | service__http | service__private | service__domain_u | service__smtp | service__ftp_data | service__telnet | ... | count | srv_count | serror_rate | srv_serror_rate | rerror_rate | srv_rerror_rate | same_srv_rate | diff_srv_rate | srv_diff_host_rate | type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 274 | 275 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.01 | normal |
| 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 7 | 17 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.24 | normal |
| 2 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 9 | 29 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.10 | normal |
| 3 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 3 | 3 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.00 | normal |
| 4 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 12 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.08 | 0.67 | 0.00 | normal |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 19529 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.00 | normal |
| 19530 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.00 | attack |
| 19531 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 188 | 187 | 0.0 | 0.0 | 0.0 | 0.0 | 0.99 | 0.01 | 0.00 | normal |
| 19532 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.00 | normal |
| 19533 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 43 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.05 | normal |
19534 rows × 46 columns
Se aprecian 19534 filas o conexiones y para cada una de ellas 45 columnas o variables asociadas (sus atributos) con sus correpondientes datos. La variable a predecir es la última columna número 46: type
Se trata de un caso de Machine Learning de Aprendizaje Supervisado (Supervised Learning) y dentro de éste, de Clasificación. Se tendrá como objetivo la construcción de un modelo predictivo de la variable a predecir explícita type, siendo ésta una categoría o clase . Ésta podrá tomar los valores attack o normal, siendo los primeros característicos de conexiones malicionas y los segundos, de conexiones buenas. Se está por lo tanto no ante un problema de clasificación multiclase, sino de clasificación binaria.
Se trata de un dataset del que se sabe relativamente poco y que se podría considerar como "nuevo" para el usuario, por lo que se considera conveniente y oportuno realizar un análisis descriptivo de los datos de las variables mediante la estadística. Se comienza con el método .describe():
conexiones_dataset.describe()
| duration | protocol_type__tcp | protocol_type__udp | protocol_type__icmp | service__http | service__private | service__domain_u | service__smtp | service__ftp_data | service__telnet | ... | is_guest_login | count | srv_count | serror_rate | srv_serror_rate | rerror_rate | srv_rerror_rate | same_srv_rate | diff_srv_rate | srv_diff_host_rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 19534.000000 | 19534.000000 | 19534.000000 | 19534.000000 | 19534.000000 | 19534.000000 | 19534.000000 | 19534.000000 | 19534.000000 | 19534.000000 | ... | 19534.000000 | 19534.000000 | 19534.000000 | 19534.000000 | 19534.000000 | 19534.000000 | 19534.000000 | 19534.000000 | 19534.000000 | 19534.000000 |
| mean | 180.821030 | 0.797174 | 0.177741 | 0.025084 | 0.545664 | 0.046534 | 0.121480 | 0.091737 | 0.064298 | 0.013361 | ... | 0.010904 | 32.808181 | 29.523446 | 0.046867 | 0.045581 | 0.051509 | 0.051962 | 0.928727 | 0.032733 | 0.122112 |
| std | 1549.405372 | 0.402114 | 0.382305 | 0.156386 | 0.497923 | 0.210644 | 0.326693 | 0.288662 | 0.245290 | 0.114819 | ... | 0.103854 | 75.561680 | 69.005003 | 0.201608 | 0.199683 | 0.217922 | 0.218368 | 0.234630 | 0.149619 | 0.266532 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 1.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 |
| 50% | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 5.000000 | 6.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 |
| 75% | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 16.000000 | 18.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.100000 |
| max | 42616.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 1.000000 | 511.000000 | 511.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
8 rows × 45 columns
Realizando una exploración global de los estadísticos y de sus correspondientes valores por variable, a priori ningún dato produce una especial llamada de atención, que pueda activar la alarma de la existencia de algún valor o situación irregular o singular, merecedora de un análisis más profundo.
Se usa el método isnull() y sum() para saber si existen registros nulos en el dataframe por columna:
conexiones_dataset.isnull().sum()
duration 0 protocol_type__tcp 0 protocol_type__udp 0 protocol_type__icmp 0 service__http 0 service__private 0 service__domain_u 0 service__smtp 0 service__ftp_data 0 service__telnet 0 service__ftp 0 service__other 0 src_bytes 0 dst_bytes 0 flag__SF 0 flag__S0 0 flag__REJ 0 flag__RSTR 0 flag__RSTO 0 flag__OTH 0 land 0 wrong_fragment 0 urgent 0 hot 0 num_failed_logins 0 logged_in 0 num_compromised 0 root_shell 0 su_attempted 0 num_root 0 num_file_creations 0 num_shells 0 num_access_files 0 num_outbound_cmds 0 is_host_login 0 is_guest_login 0 count 0 srv_count 0 serror_rate 0 srv_serror_rate 0 rerror_rate 0 srv_rerror_rate 0 same_srv_rate 0 diff_srv_rate 0 srv_diff_host_rate 0 type 0 dtype: int64
conexiones_dataset.isnull().sum().sum()
0
Se constata la no presencia de valores nulos en el dataframe.
Se considera interesante tener una visión de qué proporción de conexiones maliciosas y buenas se tiene:
pd.value_counts(conexiones_dataset['type'])
normal 18282 attack 1252 Name: type, dtype: int64
# Tabla de frecuencia relativa de conexiones maliciosas y buenas:
100 * conexiones_dataset['type'].value_counts() / len(conexiones_dataset['type'])
normal 93.590662 attack 6.409338 Name: type, dtype: float64
# Gráfico de barras de conexiones buenas y maliciosas:
plot = conexiones_dataset['type'].value_counts().plot(kind = 'bar',
title = 'Conexiones buenas y maliciosas',
color = ["blue", "red"],
rot = 0)
plt.xlabel("Tipo de conexión")
plt.ylabel("Número de conexiones")
Text(0, 0.5, 'Número de conexiones')
Se constata que la mayoría de las conexiones son buenas frente a una minoría de maliciosas. En el contexto de problemas de clasificación, se dice que un conjunto de datos no está balanceado si una de las clases (mayoritaria y en este caso conexiones buenas) está sensiblemente más representada que el resto de clases (minoritarías y en este caso conexiones maliciosas). Esto representa un dato importante a tener en cuenta a la hora del desarrollo posterior del análisis y de la construcción del mejor modelo de predicción.
Se obtiene una lista con el nombre de las columnas de las features del dataframe: columns_names_features_list para su posible uso de aquí en adelante:
conexiones_dataset_columnas_features = conexiones_dataset.loc[:, conexiones_dataset.columns != 'type']
columns_names_features = conexiones_dataset_columnas_features.columns.values
features = list(columns_names_features)
features
['duration', 'protocol_type__tcp', 'protocol_type__udp', 'protocol_type__icmp', 'service__http', 'service__private', 'service__domain_u', 'service__smtp', 'service__ftp_data', 'service__telnet', 'service__ftp', 'service__other', 'src_bytes', 'dst_bytes', 'flag__SF', 'flag__S0', 'flag__REJ', 'flag__RSTR', 'flag__RSTO', 'flag__OTH', 'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate']
len(features) # Comprobación: son 45 features (la variable 46 es el target a predecir, es decir, type)
45
Las features son: 'duration', 'protocol_typetcp', 'protocol_typeudp', 'protocol_typeicmp', 'servicehttp', 'serviceprivate', 'servicedomain_u', 'servicesmtp', 'serviceftp_data', 'servicetelnet', 'serviceftp', 'serviceother', 'src_bytes', 'dst_bytes', 'flagSF', 'flagS0', 'flagREJ', 'flagRSTR', 'flagRSTO', 'flag__OTH', 'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate'
Se va a intentar ser sistemático en el análisis más profundo que se desarrollará a partir de lo obtenido hasta aquí. ¿De qué forma? Separando en tres categorías las features para realizar las visualizaciones más adecuadas de las mismas. A saber, las "number of", las que tienen valores 0 o 1 (y 2!: 'su_attempted') y los porcentajes en tanto por uno.
1.- "Number of"
1.1.- 'duration'
conexiones_dataset["duration"].plot(kind = "box", figsize = (20, 10), color = "green")
plt.title("Duración de la conexión en segundos", fontsize = 25)
plt.xlabel("Duración", fontsize = 20)
plt.ylabel("Tiempo en s", fontsize = 20)
Text(0, 0.5, 'Tiempo en s')
joker = conexiones_dataset[["duration", "type"]]
joker_normal = joker.loc[joker["type"] == "normal"]
joker_attack = joker.loc[joker["type"] == "attack"]
joker_normal.plot(kind = "hist", bins = 500, x = "type", y = "duration", figsize = (20, 10), color = "blue")
plt.title("Duración en segundos vs Conexiones buenas", fontsize = 25)
plt.xlabel("Duración en segundos", fontsize = 20)
plt.ylabel("Conexiones buenas", fontsize = 20)
Text(0, 0.5, 'Conexiones buenas')
joker_attack.plot(kind = "hist", bins = 500, x = "type", y = "duration", figsize = (20, 10), color = "red")
plt.title("Duración en segundos vs Conexiones maliciosas", fontsize = 25)
plt.xlabel("Duración en segundos", fontsize = 20)
plt.ylabel("Conexiones maliciosas", fontsize = 20)
Text(0, 0.5, 'Conexiones maliciosas')
conexiones_dataset["duration"].describe()
count 19534.000000 mean 180.821030 std 1549.405372 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 42616.000000 Name: duration, dtype: float64
El método "describe" ya apuntaba que el grueso de los valores de duración en segundos de la conexión estaba alrededor de cero, por lo que es lógico que el grueso de las conexiones buenas y maliciosas se encuentren también en torno a este valor. Llaman la atención quizá dos aspectos. Por un lado, la gran cantidad de valores "outliers" de duración de hasta un máximo de 42616 segundos (casi 12 horas) que arroja la visualización del diagrama de bigotes y por otra, visualizando los histogramas se aprecia que para valores de duración de conexión altos aparecen un mayor número de conexiones maliciosas que buenas.
1.2.- 'src_bytes'
conexiones_dataset["src_bytes"].plot(kind = "box", figsize = (20, 10), color = "green")
plt.title("Número de bytes de datos desde la fuente al destino", fontsize = 25)
plt.xlabel("Bytes", fontsize = 20)
plt.ylabel("Número de bytes", fontsize = 20)
Text(0, 0.5, 'Número de bytes')
joker = conexiones_dataset[["src_bytes", "type"]]
joker_normal = joker.loc[joker["type"] == "normal"]
joker_attack = joker.loc[joker["type"] == "attack"]
joker_normal.plot(kind = "hist", bins = 500, x = "type", y = "src_bytes", figsize = (20, 10), color = "blue")
plt.title("Número de bytes de datos desde la fuente al destino vs Conexiones buenas", fontsize = 25)
plt.xlabel("Número de bytes", fontsize = 20)
plt.ylabel("Conexiones buenas", fontsize = 20)
Text(0, 0.5, 'Conexiones buenas')
joker_attack.plot(kind = "hist", bins = 500, x = "type", y = "src_bytes", figsize = (20, 10), color = "red")
plt.title("Número de bytes de datos desde la fuente al destino vs Conexiones maliciosas", fontsize = 25)
plt.xlabel("Número de bytes", fontsize = 20)
plt.ylabel("Conexiones maliciosas", fontsize = 20)
Text(0, 0.5, 'Conexiones maliciosas')
conexiones_dataset["src_bytes"].describe()
count 1.953400e+04 mean 1.037943e+04 std 2.151006e+05 min 0.000000e+00 25% 4.800000e+01 50% 2.260000e+02 75% 3.160000e+02 max 2.194552e+07 Name: src_bytes, dtype: float64
El método "describe" ya apuntaba que el grueso de los valores del número de bytes de datos desde la fuente al destino de la conexión estaba entre 5 y 316, por lo que es lógico que el grueso de las conexiones buenas y maliciosas se encuentren también en torno a estos valores. Llama la atención la presencia de un outlier de valor máximo de 21.945.520 bytes siendo esta conexión maliciosa.
1.3.- 'dst_bytes'
conexiones_dataset["dst_bytes"].plot(kind = "box", figsize = (20, 10), color = "green")
plt.title("Número de bytes de datos desde el destino a la fuente", fontsize = 25)
plt.xlabel("Bytes", fontsize = 20)
plt.ylabel("Número de bytes", fontsize = 20)
Text(0, 0.5, 'Número de bytes')
joker = conexiones_dataset[["dst_bytes", "type"]]
joker_normal = joker.loc[joker["type"] == "normal"]
joker_attack = joker.loc[joker["type"] == "attack"]
joker_normal.plot(kind = "hist", bins = 500, x = "type", y = "dst_bytes", figsize = (20, 10), color = "blue")
plt.title("Número de bytes de datos desde el destino a la fuente vs Conexiones buenas", fontsize = 25)
plt.xlabel("Número de bytes", fontsize = 20)
plt.ylabel("Conexiones buenas", fontsize = 20)
Text(0, 0.5, 'Conexiones buenas')
joker_attack.plot(kind = "hist", bins = 500, x = "type", y = "dst_bytes", figsize = (20, 10), color = "red")
plt.title("Número de bytes de datos desde el destino a la fuente vs Conexiones maliciosas", fontsize = 25)
plt.xlabel("Número de bytes", fontsize = 20)
plt.ylabel("Conexiones maliciosas", fontsize = 20)
Text(0, 0.5, 'Conexiones maliciosas')
conexiones_dataset["dst_bytes"].describe()
count 1.953400e+04 mean 4.036082e+03 std 5.838140e+04 min 0.000000e+00 25% 5.025000e+01 50% 3.535000e+02 75% 1.980000e+03 max 5.150938e+06 Name: dst_bytes, dtype: float64
El método "describe" ya apuntaba que el grueso de los valores del número de bytes de datos desde el destino a la fuente de la conexión estaba entre 5 y 1980, por lo que es lógico que el grueso de las conexiones buenas y maliciosas se encuentren también en torno a estos valores. Llama la atención la presencia de un outlier de valor máximo de 5.150.938 bytes que al igual que pasaba con el outlier de bytes desde la fuente al destino también corresponde a una conexión maliciosa.
1.4.- 'wrong_fragment'
conexiones_dataset["wrong_fragment"].plot(kind = "box", figsize = (20, 10), color = "green")
plt.title("Número de fragmentos erróneos", fontsize = 25)
plt.xlabel("Fragmentos erróneos", fontsize = 20)
plt.ylabel("Número", fontsize = 20)
Text(0, 0.5, 'Número')
joker = conexiones_dataset[["wrong_fragment", "type"]]
joker_normal = joker.loc[joker["type"] == "normal"]
joker_attack = joker.loc[joker["type"] == "attack"]
joker_normal.plot(kind = "hist", bins = 500, x = "type", y = "wrong_fragment", figsize = (20, 10), color = "blue")
plt.title("Número de fragmentos erróneos vs Conexiones buenas", fontsize = 25)
plt.xlabel("Número de fragmentos erróneos", fontsize = 20)
plt.ylabel("Conexiones buenas", fontsize = 20)
Text(0, 0.5, 'Conexiones buenas')
joker_attack.plot(kind = "hist", bins = 500, x = "type", y = "wrong_fragment", figsize = (20, 10), color = "red")
plt.title("Número de fragmentos erróneos vs Conexiones maliciosas", fontsize = 25)
plt.xlabel("Número de fragmentos erróneos", fontsize = 20)
plt.ylabel("Conexiones maliciosas", fontsize = 20)
Text(0, 0.5, 'Conexiones maliciosas')
conexiones_dataset["wrong_fragment"].describe()
count 19534.000000 mean 0.003993 std 0.103610 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 3.000000 Name: wrong_fragment, dtype: float64
El método "describe" ya apuntaba que el grueso de los valores del número de frágmentos erróneos de la conexión son cero, por lo que es lógico que el grueso de las conexiones buenas y maliciosas se encuentren también en torno a estos valores. Llama la atención la presencia de outliers de valor 1 y 3 que corresponden a conexiones maliciosas con una proporción mucho mayor que la del conjunto.
wrong_fragment_mask = conexiones_dataset["wrong_fragment"] == 1
wrong_fragment_1 = conexiones_dataset[wrong_fragment_mask]
wrong_fragment_1
| duration | protocol_type__tcp | protocol_type__udp | protocol_type__icmp | service__http | service__private | service__domain_u | service__smtp | service__ftp_data | service__telnet | ... | count | srv_count | serror_rate | srv_serror_rate | rerror_rate | srv_rerror_rate | same_srv_rate | diff_srv_rate | srv_diff_host_rate | type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 320 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 3 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.67 | attack |
| 474 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 12 | 12 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.00 | normal |
| 3381 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 4 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.50 | normal |
| 4425 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 3 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.67 | attack |
| 4492 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 4 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.50 | normal |
| 8465 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 12 | 12 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.00 | attack |
| 8973 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 3 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.67 | normal |
| 9335 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 4 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.50 | attack |
| 11443 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 4 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.50 | normal |
| 13703 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 4 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.50 | normal |
| 14797 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 8 | 8 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.00 | attack |
| 17510 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 2 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.00 | normal |
12 rows × 46 columns
wrong_fragment_mask = conexiones_dataset["wrong_fragment"] == 3
wrong_fragment_3 = conexiones_dataset[wrong_fragment_mask]
wrong_fragment_3
| duration | protocol_type__tcp | protocol_type__udp | protocol_type__icmp | service__http | service__private | service__domain_u | service__smtp | service__ftp_data | service__telnet | ... | count | srv_count | serror_rate | srv_serror_rate | rerror_rate | srv_rerror_rate | same_srv_rate | diff_srv_rate | srv_diff_host_rate | type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1344 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 19 | 19 | 0.00 | 0.0 | 0.00 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 3079 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 67 | 67 | 0.00 | 0.0 | 0.00 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 4229 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 28 | 28 | 0.00 | 0.0 | 0.00 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 5409 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 27 | 27 | 0.00 | 0.0 | 0.00 | 0.0 | 1.00 | 0.00 | 0.0 | attack |
| 5513 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 92 | 92 | 0.00 | 0.0 | 0.00 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 6616 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 91 | 91 | 0.00 | 0.0 | 0.00 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 6991 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 44 | 44 | 0.00 | 0.0 | 0.00 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 7093 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 74 | 72 | 0.01 | 0.0 | 0.01 | 0.0 | 0.97 | 0.04 | 0.0 | attack |
| 7661 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 61 | 61 | 0.00 | 0.0 | 0.00 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 7838 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 93 | 93 | 0.00 | 0.0 | 0.00 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 8639 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 9 | 9 | 0.00 | 0.0 | 0.00 | 0.0 | 1.00 | 0.00 | 0.0 | attack |
| 9013 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 70 | 70 | 0.00 | 0.0 | 0.00 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 9370 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 74 | 74 | 0.00 | 0.0 | 0.00 | 0.0 | 1.00 | 0.00 | 0.0 | attack |
| 10057 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 8 | 8 | 0.00 | 0.0 | 0.00 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 13282 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 26 | 26 | 0.00 | 0.0 | 0.00 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 13292 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 162 | 62 | 0.62 | 0.0 | 0.00 | 0.0 | 0.38 | 0.04 | 0.0 | normal |
| 13924 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 63 | 63 | 0.00 | 0.0 | 0.00 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 14147 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 66 | 66 | 0.00 | 0.0 | 0.00 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 14935 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 60 | 60 | 0.00 | 0.0 | 0.00 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 16545 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 30 | 30 | 0.00 | 0.0 | 0.00 | 0.0 | 1.00 | 0.00 | 0.0 | attack |
| 17941 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 48 | 48 | 0.00 | 0.0 | 0.00 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 18786 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 82 | 82 | 0.00 | 0.0 | 0.00 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
22 rows × 46 columns
Hasta aquí parece constatarse la pauta de que los valores atípicos están asociados mayoritariamente con conexiones maliciosas.
1.5.- 'urgent'
conexiones_dataset["urgent"].plot(kind = "box", figsize = (20, 10), color = "green")
plt.title("Número de paquetes urgentes", fontsize = 25)
plt.xlabel("Paquetes urgentes", fontsize = 20)
plt.ylabel("Número", fontsize = 20)
Text(0, 0.5, 'Número')
joker = conexiones_dataset[["urgent", "type"]]
joker_normal = joker.loc[joker["type"] == "normal"]
joker_attack = joker.loc[joker["type"] == "attack"]
joker_normal.plot(kind = "hist", bins = 500, x = "type", y = "urgent", figsize = (20, 10), color = "blue")
plt.title("Número de paquetes urgentes vs Conexiones buenas", fontsize = 25)
plt.xlabel("Número de paquetes urgentes", fontsize = 20)
plt.ylabel("Conexiones buenas", fontsize = 20)
Text(0, 0.5, 'Conexiones buenas')
joker_attack.plot(kind = "hist", bins = 500, x = "type", y = "urgent", figsize = (20, 10), color = "red")
plt.title("Número de paquetes urgentes vs Conexiones maliciosas", fontsize = 25)
plt.xlabel("Número de paquetes urgentes", fontsize = 20)
plt.ylabel("Conexiones maliciosas", fontsize = 20)
Text(0, 0.5, 'Conexiones maliciosas')
conexiones_dataset["urgent"].describe()
count 19534.000000 mean 0.000205 std 0.020237 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 2.000000 Name: urgent, dtype: float64
El método "describe" ya apuntaba que el grueso de los valores del número de paquetes urgentes de la conexión son cero, por lo que es lógico que el grueso de las conexiones buenas y maliciosas se encuentren también en torno a estos valores. Llama la atención la presencia de dos outliers de valor 2 que corresponden a conexiones buenas.
urgent_mask = conexiones_dataset["urgent"] == 2
urgent_filtrado_2 = conexiones_dataset[urgent_mask]
urgent_filtrado_2
| duration | protocol_type__tcp | protocol_type__udp | protocol_type__icmp | service__http | service__private | service__domain_u | service__smtp | service__ftp_data | service__telnet | ... | count | srv_count | serror_rate | srv_serror_rate | rerror_rate | srv_rerror_rate | same_srv_rate | diff_srv_rate | srv_diff_host_rate | type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1305 | 2514 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | normal |
| 18419 | 15 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | normal |
2 rows × 46 columns
1.6.- 'hot'
conexiones_dataset["hot"].plot(kind = "box", figsize = (20, 10), color = "green")
plt.title("Número de indicadores hot", fontsize = 25)
plt.xlabel("Indicadores hot", fontsize = 20)
plt.ylabel("Número", fontsize = 20)
Text(0, 0.5, 'Número')
joker = conexiones_dataset[["hot", "type"]]
joker_normal = joker.loc[joker["type"] == "normal"]
joker_attack = joker.loc[joker["type"] == "attack"]
joker_normal.plot(kind = "hist", bins = 500, x = "type", y = "hot", figsize = (20, 10), color = "blue")
plt.title("Número de indicadores hot vs Conexiones buenas", fontsize = 25)
plt.xlabel("Número de indicadores hot", fontsize = 20)
plt.ylabel("Conexiones buenas", fontsize = 20)
Text(0, 0.5, 'Conexiones buenas')
joker_attack.plot(kind = "hist", bins = 500, x = "type", y = "hot", figsize = (20, 10), color = "red")
plt.title("Número de indicadores hot vs Conexiones maliciosas", fontsize = 25)
plt.xlabel("Número de indicadores hot", fontsize = 20)
plt.ylabel("Conexiones maliciosas", fontsize = 20)
Text(0, 0.5, 'Conexiones maliciosas')
conexiones_dataset["hot"].describe()
count 19534.000000 mean 0.181888 std 2.011012 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 44.000000 Name: hot, dtype: float64
El método "describe" ya apuntaba que el grueso de los valores del número de indicadores hot de la conexión son cero, por lo que es lógico que el grueso de las conexiones buenas y maliciosas se encuentren también en torno a estos valores. Llama la atención que los outliers a partir de valores en torno a 2 conexiones hot correspondan todos a conexiones buenas.
Se aprecia en los límites marcados por la gráfica de fallidas hasta valores 2, pero se pasa a comprobar esta afirmación por rigor y veracidad.
Tanto urgent como hot rompen la tendencia que paracía darse respecto a los outliers de lo analizado hasta este punto, ya que para estas dos variables los outliers corresponden en mayor medida a conexiones buenas y no a maliciosas.
hot_mask = conexiones_dataset["hot"] > 2
hot_filtrado_mayor_2 = conexiones_dataset[hot_mask]
hot_filtrado_mayor_2
| duration | protocol_type__tcp | protocol_type__udp | protocol_type__icmp | service__http | service__private | service__domain_u | service__smtp | service__ftp_data | service__telnet | ... | count | srv_count | serror_rate | srv_serror_rate | rerror_rate | srv_rerror_rate | same_srv_rate | diff_srv_rate | srv_diff_host_rate | type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 29 | 26 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 87 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 152 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 291 | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 319 | 33 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 369 | 24 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 430 | 30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 452 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 568 | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 632 | 30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 732 | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 755 | 29 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 867 | 23 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 884 | 30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 887 | 26 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 906 | 56 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 1396 | 29 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 1481 | 29 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 0.50 | 1.00 | 0.0 | normal |
| 1506 | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 2 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 1546 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 1846 | 22 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 1863 | 33 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 1924 | 71 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 0.50 | 1.00 | 0.0 | normal |
| 1954 | 21 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 2029 | 29 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 2142 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 2255 | 30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 2276 | 35 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 2292 | 26 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 2336 | 21 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 2 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 1.0 | normal |
| 2393 | 32 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 2417 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 2473 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 2513 | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 2582 | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 2764 | 30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 2773 | 29 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 3245 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 3313 | 21 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 3359 | 22 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 3409 | 23 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 3 | 2 | 0.00 | 0.00 | 0.0 | 0.0 | 0.33 | 0.67 | 1.0 | normal |
| 3445 | 26 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 3515 | 30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 3796 | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 3799 | 21 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 3897 | 23 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 4185 | 30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 4416 | 29 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 4562 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 4753 | 31 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 4816 | 32 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 4869 | 25 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 4898 | 26 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 4975 | 21 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 5003 | 29 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 5014 | 30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 5173 | 62 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 5297 | 23 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 5415 | 32 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 5441 | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 5447 | 23 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 5465 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 5488 | 26 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 5789 | 26 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 5900 | 30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 6165 | 23 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 6167 | 25 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 6187 | 30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 6245 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 6275 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 6276 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 6308 | 30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 6345 | 30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 6346 | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 6372 | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 6636 | 20 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 6663 | 249 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 6672 | 9390 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 6745 | 26 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 6874 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 6903 | 20 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 6947 | 20 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 2 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 7177 | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 7252 | 22 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 7285 | 32 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 7537 | 21 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 7571 | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 7635 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 7866 | 30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 7895 | 24 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 7913 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 7971 | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 8014 | 22 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 8019 | 31 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 8117 | 31 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 0.50 | 1.00 | 0.0 | normal |
| 8140 | 20 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 8426 | 43 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 8588 | 25 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 8641 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 8683 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 8847 | 25 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 8867 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 8891 | 23 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 3 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 0.33 | 0.67 | 0.0 | normal |
| 8908 | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 8910 | 30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 9024 | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 9058 | 30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 9089 | 30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 0.50 | 1.00 | 0.0 | normal |
| 9110 | 642 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 9204 | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 9212 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 3 | 3 | 0.33 | 0.33 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 9272 | 24 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 9285 | 37 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 9470 | 25 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 9566 | 20 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 9586 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 9753 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 9861 | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 10088 | 21 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 2 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 10107 | 33 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 10173 | 22 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 10249 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 1.00 | 1.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 10253 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 13 | 13 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 10288 | 26 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 10574 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 10615 | 31 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 10650 | 31 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 10773 | 29 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 10928 | 22 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 10993 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 2 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 11107 | 20 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 11116 | 29 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 11146 | 23 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 11213 | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 2 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 1.0 | normal |
| 11305 | 50 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 11323 | 23 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 11356 | 30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 11367 | 21 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 11387 | 31 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 11411 | 26 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 11558 | 24 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 11674 | 23 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 11708 | 29 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 11754 | 22 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 11783 | 21 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 12085 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 12165 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 12298 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 12381 | 26 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 12413 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 2 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 12535 | 37 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 12727 | 26 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 12766 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 12880 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 12977 | 26 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 13070 | 24 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 13402 | 30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 13438 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 13568 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 13631 | 35 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 13662 | 32 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 14063 | 24 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 14079 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 14114 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 0.50 | 1.00 | 0.0 | normal |
| 14124 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 14247 | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 14463 | 31 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 14515 | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 14766 | 20 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 14866 | 22 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 14940 | 22 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 15083 | 31 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 15213 | 20 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 15241 | 20 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 15264 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 15329 | 38 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 15486 | 23 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 15599 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 15715 | 33 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 15717 | 23 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 15751 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 15832 | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 15847 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 15888 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 16231 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 16307 | 5062 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 16497 | 29 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 16763 | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 16878 | 25 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 16902 | 204 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 17085 | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 17208 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 17416 | 21 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 18116 | 31 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 18332 | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 2 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 18373 | 32 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 18490 | 26 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 18500 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 18749 | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 2 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 1.0 | normal |
| 18759 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 19092 | 24 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 19270 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 19337 | 20 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 19370 | 23 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 2 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 1.0 | normal |
| 19460 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.00 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
205 rows × 46 columns
# 205 rows demasiados:
"attack" in hot_filtrado_mayor_2.values
False
1.7.- 'num_failed_logins'
conexiones_dataset["num_failed_logins"].plot(kind = "box", figsize = (20, 10), color = "green")
plt.title("Número de intentos de login fallidos", fontsize = 25)
plt.xlabel("Intentos de login fallidos", fontsize = 20)
plt.ylabel("Número", fontsize = 20)
Text(0, 0.5, 'Número')
joker = conexiones_dataset[["num_failed_logins", "type"]]
joker_normal = joker.loc[joker["type"] == "normal"]
joker_attack = joker.loc[joker["type"] == "attack"]
joker_normal.plot(kind = "hist", bins = 500, x = "type", y = "num_failed_logins", figsize = (20, 10), color = "blue")
plt.title("Número de intentos de login fallidos vs Conexiones buenas", fontsize = 25)
plt.xlabel("Número de intentos de login fallidos", fontsize = 20)
plt.ylabel("Conexiones buenas", fontsize = 20)
Text(0, 0.5, 'Conexiones buenas')
joker_attack.plot(kind = "hist", bins = 500, x = "type", y = "num_failed_logins", figsize = (20, 10), color = "red")
plt.title("Número de intentos de login fallidos vs Conexiones maliciosas", fontsize = 25)
plt.xlabel("Número de intentos de login fallidos", fontsize = 20)
plt.ylabel("Conexiones maliciosas", fontsize = 20)
Text(0, 0.5, 'Conexiones maliciosas')
conexiones_dataset["num_failed_logins"].describe()
count 19534.000000 mean 0.001075 std 0.038516 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 3.000000 Name: num_failed_logins, dtype: float64
El método "describe" ya apuntaba que el grueso de los valores del número de intentos de login fallidos de la conexión son cero, por lo que es lógico que el grueso de las conexiones buenas y maliciosas se encuentren también en torno a estos valores. Llama la atención que de los outliers con valores 1, 2 y 3, los valores 2 y 3 correspondan a conexiones buenas y que los valores de 1 intento de login fallido correspondan a conexiones maliciosas con una proporción mucho mayor que la del conjunto.
num_failed_logins_mask = conexiones_dataset["num_failed_logins"] == 1
num_failed_logins_filtrado_1 = conexiones_dataset[num_failed_logins_mask]
num_failed_logins_filtrado_1
| duration | protocol_type__tcp | protocol_type__udp | protocol_type__icmp | service__http | service__private | service__domain_u | service__smtp | service__ftp_data | service__telnet | ... | count | srv_count | serror_rate | srv_serror_rate | rerror_rate | srv_rerror_rate | same_srv_rate | diff_srv_rate | srv_diff_host_rate | type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2119 | 319 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 2588 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 4058 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 4571 | 60 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 7165 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 2 | 2 | 0.0 | 0.0 | 1.0 | 1.0 | 1.00 | 0.00 | 0.0 | attack |
| 8240 | 4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 10050 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 10533 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.0 | 0.0 | 1.0 | 1.0 | 1.00 | 0.00 | 0.0 | attack |
| 13342 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.0 | 0.0 | 1.0 | 1.0 | 1.00 | 0.00 | 0.0 | normal |
| 13707 | 172 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 13 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.08 | 0.15 | 0.0 | normal |
| 13812 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | attack |
| 16171 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 16905 | 466 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 17062 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.50 | 1.00 | 0.0 | normal |
| 17407 | 4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 19530 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | attack |
16 rows × 46 columns
num_failed_logins_mask = conexiones_dataset["num_failed_logins"] == 2
num_failed_logins_filtrado_2 = conexiones_dataset[num_failed_logins_mask]
num_failed_logins_filtrado_2
| duration | protocol_type__tcp | protocol_type__udp | protocol_type__icmp | service__http | service__private | service__domain_u | service__smtp | service__ftp_data | service__telnet | ... | count | srv_count | serror_rate | srv_serror_rate | rerror_rate | srv_rerror_rate | same_srv_rate | diff_srv_rate | srv_diff_host_rate | type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 18467 | 1049 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | normal |
1 rows × 46 columns
num_failed_logins_mask = conexiones_dataset["num_failed_logins"] == 3
num_failed_logins_filtrado_3 = conexiones_dataset[num_failed_logins_mask]
num_failed_logins_filtrado_3
| duration | protocol_type__tcp | protocol_type__udp | protocol_type__icmp | service__http | service__private | service__domain_u | service__smtp | service__ftp_data | service__telnet | ... | count | srv_count | serror_rate | srv_serror_rate | rerror_rate | srv_rerror_rate | same_srv_rate | diff_srv_rate | srv_diff_host_rate | type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 15110 | 4746 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | normal |
1 rows × 46 columns
1.8.- 'num_compromised'
conexiones_dataset["num_compromised"].plot(kind = "box", figsize = (20, 10), color = "green")
plt.title("Número de condiciones comprometidas", fontsize = 25)
plt.xlabel("Condiciones comprometidas", fontsize = 20)
plt.ylabel("Número", fontsize = 20)
Text(0, 0.5, 'Número')
joker = conexiones_dataset[["num_compromised", "type"]]
joker_normal = joker.loc[joker["type"] == "normal"]
joker_attack = joker.loc[joker["type"] == "attack"]
joker_normal.plot(kind = "hist", bins = 500, x = "type", y = "num_compromised", figsize = (20, 10), color = "blue")
plt.title("Número de condiciones comprometidas vs Conexiones buenas", fontsize = 25)
plt.xlabel("Número de condiciones comprometidas", fontsize = 20)
plt.ylabel("Conexiones buenas", fontsize = 20)
Text(0, 0.5, 'Conexiones buenas')
joker_attack.plot(kind = "hist", bins = 500, x = "type", y = "num_compromised", figsize = (20, 10), color = "red")
plt.title("Número de condiciones comprometidas vs Conexiones maliciosas", fontsize = 25)
plt.xlabel("Número de condiciones comprometidas", fontsize = 20)
plt.ylabel("Conexiones maliciosas", fontsize = 20)
Text(0, 0.5, 'Conexiones maliciosas')
conexiones_dataset["num_compromised"].describe()
count 19534.000000 mean 0.778284 std 56.368063 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 7479.000000 Name: num_compromised, dtype: float64
El método "describe" ya apuntaba que el grueso de los valores del número de condiciones comprometidas de la conexión son cero, por lo que es lógico que el grueso de las conexiones buenas y maliciosas se encuentren también en torno a estos valores. Llama la atención el valor extremo máximo de 7479 condiciones comprometidas y que los outliers a partir de valores en torno a 2 correspondan todos a conexiones buenas (muy similar a lo que sucedía con el número de indicadores hot a partir también de 2).
Se aprecia en los límites marcados por la gráfica de fallidas hasta valores 2, pero se pasa a comprobar esta afirmación por rigor y veracidad.
num_compromised_mask = conexiones_dataset["num_compromised"] > 2
num_compromised_filtrado_mayor_2 = conexiones_dataset[num_compromised_mask]
num_compromised_filtrado_mayor_2
| duration | protocol_type__tcp | protocol_type__udp | protocol_type__icmp | service__http | service__private | service__domain_u | service__smtp | service__ftp_data | service__telnet | ... | count | srv_count | serror_rate | srv_serror_rate | rerror_rate | srv_rerror_rate | same_srv_rate | diff_srv_rate | srv_diff_host_rate | type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1305 | 2514 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 1437 | 395 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 1892 | 571 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 1899 | 762 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 2311 | 57 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 2725 | 16754 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 2738 | 642 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 3292 | 507 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 3295 | 310 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 3316 | 593 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 4005 | 580 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 4090 | 346 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 4149 | 493 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 5175 | 14362 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 5722 | 32 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 5949 | 15509 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 6138 | 9305 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 6597 | 14943 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 7006 | 6869 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 7053 | 11773 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 7190 | 15883 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 151 | 1 | 0.99 | 0.0 | 0.0 | 0.0 | 0.01 | 0.07 | 0.0 | normal |
| 7565 | 832 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 2 | 2 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 7635 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 8402 | 694 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 10000 | 914 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 10253 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 13 | 13 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 10818 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 1.00 | 1.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 10849 | 560 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 10872 | 140 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 10993 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 2 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 12802 | 11565 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 13607 | 583 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 13610 | 658 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 13778 | 14517 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 14026 | 14857 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 14456 | 13193 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 15110 | 4746 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 15253 | 16187 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 15480 | 2552 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 15932 | 729 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 16602 | 18848 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 16742 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 1.00 | 1.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 16886 | 322 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 17309 | 5328 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 17433 | 842 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 17543 | 314 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 18171 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 1.00 | 1.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
| 18986 | 12988 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | normal |
48 rows × 46 columns
"attack" in num_compromised_filtrado_mayor_2.values
False
1.9.- 'num_root'
conexiones_dataset["num_root"].plot(kind = "box", figsize = (20, 10), color = "green")
plt.title("Número de accesos raíz", fontsize = 25)
plt.xlabel("Accesos raíz", fontsize = 20)
plt.ylabel("Número", fontsize = 20)
Text(0, 0.5, 'Número')
joker = conexiones_dataset[["num_root", "type"]]
joker_normal = joker.loc[joker["type"] == "normal"]
joker_attack = joker.loc[joker["type"] == "attack"]
joker_normal.plot(kind = "hist", bins = 500, x = "type", y = "num_root", figsize = (20, 10), color = "blue")
plt.title("Número de accesos raíz vs Conexiones buenas", fontsize = 25)
plt.xlabel("Número de accesos raíz", fontsize = 20)
plt.ylabel("Conexiones buenas", fontsize = 20)
Text(0, 0.5, 'Conexiones buenas')
joker_attack.plot(kind = "hist", bins = 500, x = "type", y = "num_root", figsize = (20, 10), color = "red")
plt.title("Número de accesos raíz vs Conexiones maliciosas", fontsize = 25)
plt.xlabel("Número de accesos raíz", fontsize = 20)
plt.ylabel("Conexiones maliciosas", fontsize = 20)
Text(0, 0.5, 'Conexiones maliciosas')
conexiones_dataset["num_root"].describe()
count 19534.000000 mean 0.824767 std 56.620991 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 7468.000000 Name: num_root, dtype: float64
El método "describe" ya apuntaba que el grueso de los valores del número de accesos raíz de la conexión son cero, por lo que es lógico que el grueso de las conexiones buenas y maliciosas se encuentren también en torno a estos valores. Llama la atención el valor extremo máximo de 7468 accesos raíz y que los outliers a partir de valores en torno a 1 correspondan todos a conexiones buenas (muy similar a lo que sucedía con el número de indicadores hot y condiciones comprometidas a partir de 2).
num_root_mask = conexiones_dataset["num_root"] > 1
num_root_filtrado_mayor_1 = conexiones_dataset[num_root_mask]
"attack" in num_root_filtrado_mayor_1.values
False
1.10.- 'num_file_creations'
conexiones_dataset["num_file_creations"].plot(kind = "box", figsize = (20, 10), color = "green")
plt.title("Número de operaciones de creación de fichero", fontsize = 25)
plt.xlabel("Operaciones de creación de fichero", fontsize = 20)
plt.ylabel("Número", fontsize = 20)
Text(0, 0.5, 'Número')
joker = conexiones_dataset[["num_file_creations", "type"]]
joker_normal = joker.loc[joker["type"] == "normal"]
joker_attack = joker.loc[joker["type"] == "attack"]
joker_normal.plot(kind = "hist", bins = 500, x = "type", y = "num_file_creations", figsize = (20, 10), color = "blue")
plt.title("Número de operaciones de creación de fichero vs Conexiones buenas", fontsize = 25)
plt.xlabel("Número de operaciones de creación de fichero", fontsize = 20)
plt.ylabel("Conexiones buenas", fontsize = 20)
Text(0, 0.5, 'Conexiones buenas')
joker_attack.plot(kind = "hist", bins = 500, x = "type", y = "num_file_creations", figsize = (20, 10), color = "red")
plt.title("Número de operaciones de creación de fichero vs Conexiones maliciosas", fontsize = 25)
plt.xlabel("Número de operaciones de creación de fichero", fontsize = 20)
plt.ylabel("Conexiones maliciosas", fontsize = 20)
Text(0, 0.5, 'Conexiones maliciosas')
conexiones_dataset["num_file_creations"].describe()
count 19534.000000 mean 0.017764 std 0.600552 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 43.000000 Name: num_file_creations, dtype: float64
El método "describe" ya apuntaba que el grueso de los valores del número de operaciones de creación de fichero de la conexión son cero, por lo que es lógico que el grueso de las conexiones buenas y maliciosas se encuentren también en torno a estos valores. Llama la atención que los outliers correspondan todos a conexiones buenas (muy similar a lo que sucedía con el número de indicadores hot, condiciones comprometidas y accesos raíz).
num_file_creations_mask = conexiones_dataset["num_file_creations"] > 1
num_file_creations_filtrado_mayor_1 = conexiones_dataset[num_file_creations_mask]
"attack" in num_file_creations_filtrado_mayor_1.values
False
1.11.- 'num_shells'
conexiones_dataset["num_shells"].plot(kind = "box", figsize = (20, 10), color = "green")
plt.title("Número de shell prompts", fontsize = 25)
plt.xlabel("Shell prompts", fontsize = 20)
plt.ylabel("Número", fontsize = 20)
Text(0, 0.5, 'Número')
joker = conexiones_dataset[["num_shells", "type"]]
joker_normal = joker.loc[joker["type"] == "normal"]
joker_attack = joker.loc[joker["type"] == "attack"]
joker_normal.plot(kind = "hist", bins = 500, x = "type", y = "num_shells", figsize = (20, 10), color = "blue")
plt.title("Número de shell prompts vs Conexiones buenas", fontsize = 25)
plt.xlabel("Número de shell prompts", fontsize = 20)
plt.ylabel("Conexiones buenas", fontsize = 20)
Text(0, 0.5, 'Conexiones buenas')
joker_attack.plot(kind = "hist", bins = 500, x = "type", y = "num_shells", figsize = (20, 10), color = "red")
plt.title("Número de shell prompts vs Conexiones maliciosas", fontsize = 25)
plt.xlabel("Número de shell prompts", fontsize = 20)
plt.ylabel("Conexiones maliciosas", fontsize = 20)
Text(0, 0.5, 'Conexiones maliciosas')
conexiones_dataset["num_shells"].describe()
count 19534.000000 mean 0.000307 std 0.017524 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 1.000000 Name: num_shells, dtype: float64
El método "describe" ya apuntaba que el grueso de los valores del número de shell prompts de la conexión son cero, por lo que es lógico que el grueso de las conexiones buenas y maliciosas se encuentren también en torno a estos valores. Llama la atención que los outliers correspondan todos a conexiones buenas (muy similar a lo que sucedía con el número de indicadores hot, condiciones comprometidas, accesos raíz y operaciones de creación de fichero).
num_shells_mask = conexiones_dataset["num_shells"] > 1
num_shells_filtrado_mayor_1 = conexiones_dataset[num_shells_mask]
"attack" in num_shells_filtrado_mayor_1.values
False
1.12.- 'num_access_files'
conexiones_dataset["num_access_files"].plot(kind = "box", figsize = (20, 10), color = "green")
plt.title("Número de access files", fontsize = 25)
plt.xlabel("Access files", fontsize = 20)
plt.ylabel("Número", fontsize = 20)
Text(0, 0.5, 'Número')
joker = conexiones_dataset[["num_access_files", "type"]]
joker_normal = joker.loc[joker["type"] == "normal"]
joker_attack = joker.loc[joker["type"] == "attack"]
joker_normal.plot(kind = "hist", bins = 500, x = "type", y = "num_access_files", figsize = (20, 10), color = "blue")
plt.title("Número de access files vs Conexiones buenas", fontsize = 25)
plt.xlabel("Número de access files", fontsize = 20)
plt.ylabel("Conexiones buenas", fontsize = 20)
Text(0, 0.5, 'Conexiones buenas')
joker_attack.plot(kind = "hist", bins = 500, x = "type", y = "num_access_files", figsize = (20, 10), color = "red")
plt.title("Número de access files vs Conexiones maliciosas", fontsize = 25)
plt.xlabel("Número de access files", fontsize = 20)
plt.ylabel("Conexiones maliciosas", fontsize = 20)
Text(0, 0.5, 'Conexiones maliciosas')
conexiones_dataset["num_access_files"].describe()
count 19534.000000 mean 0.007935 std 0.152081 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 9.000000 Name: num_access_files, dtype: float64
El método "describe" ya apuntaba que el grueso de los valores del número de access files de la conexión son cero, por lo que es lógico que el grueso de las conexiones buenas y maliciosas se encuentren también en torno a estos valores. Llama la atención que los outliers mayores a 1 access file correspondan todos a conexiones buenas (muy similar a lo que sucedía con el número de indicadores hot, condiciones comprometidas, accesos raíz, operaciones de creación de fichero y shell prompts).
num_access_files_mask = conexiones_dataset["num_access_files"] == 1
num_access_files_filtrado_1 = conexiones_dataset[num_access_files_mask]
num_access_files_filtrado_1 # buenas y maliciosas: proporción similar a la del conjunto.
| duration | protocol_type__tcp | protocol_type__udp | protocol_type__icmp | service__http | service__private | service__domain_u | service__smtp | service__ftp_data | service__telnet | ... | count | srv_count | serror_rate | srv_serror_rate | rerror_rate | srv_rerror_rate | same_srv_rate | diff_srv_rate | srv_diff_host_rate | type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 392 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 6 | 6 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 467 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 16 | 16 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 597 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 4 | 4 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 695 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 3 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 1.00 | normal |
| 1384 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 11 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.18 | normal |
| 1813 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 3 | 3 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 1945 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 2063 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 24 | 24 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 2235 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 2682 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 2725 | 16754 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 2792 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 2 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 1.00 | normal |
| 2896 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 3055 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 4 | 5 | 0.00 | 0.0 | 0.00 | 0.20 | 1.00 | 0.00 | 0.40 | normal |
| 3292 | 507 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 3295 | 310 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 3348 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 4 | 4 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 3360 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 4 | 10 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.20 | normal |
| 3368 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 4 | 4 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 3480 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 2 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 3518 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 5 | 5 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 3681 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 3878 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 4090 | 346 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 4988 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | attack |
| 4998 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 7 | 11 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.18 | normal |
| 5009 | 5 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 5873 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 4 | 4 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 5949 | 15509 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 6115 | 12123 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 6339 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 6356 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 2 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 6420 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 2 | 2 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 6512 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 7 | 6 | 0.00 | 0.0 | 0.00 | 0.00 | 0.86 | 0.29 | 0.00 | normal |
| 6538 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 7190 | 15883 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 151 | 1 | 0.99 | 0.0 | 0.00 | 0.00 | 0.01 | 0.07 | 0.00 | normal |
| 7582 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 2 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 7710 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 3 | 3 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 7985 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 8138 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 5 | 5 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 8285 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 4 | 5 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.40 | normal |
| 8528 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 3 | 3 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 9080 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 9171 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 2 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 9264 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 4 | 4 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 9755 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 3 | 3 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 9863 | 2670 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 10666 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 5 | 0.00 | 0.2 | 0.00 | 0.00 | 1.00 | 0.00 | 0.60 | normal |
| 10919 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 11059 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 2 | 2 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 11135 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 4 | 4 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 11520 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 3 | 4 | 0.00 | 0.0 | 0.00 | 0.25 | 1.00 | 0.00 | 0.50 | normal |
| 11690 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 3 | 4 | 0.00 | 0.0 | 0.00 | 0.25 | 1.00 | 0.00 | 0.50 | normal |
| 11729 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 4 | 4 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 12206 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 2 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 12537 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 12711 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 2 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | attack |
| 12847 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 12848 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 5 | 9 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.33 | normal |
| 12937 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 2 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 13031 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 3 | 4 | 0.00 | 0.0 | 0.00 | 0.25 | 1.00 | 0.00 | 0.50 | normal |
| 13406 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 2 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 13610 | 658 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 13737 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 14353 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 2 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 1.00 | normal |
| 14670 | 799 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 3 | 2 | 0.00 | 0.0 | 0.67 | 1.00 | 0.67 | 0.67 | 0.00 | normal |
| 14681 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 2 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 1.00 | normal |
| 14941 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 3 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 1.00 | normal |
| 14946 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 15055 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 2 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 1.00 | normal |
| 15115 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 15480 | 2552 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 16106 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 16440 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 3 | 9 | 0.00 | 0.0 | 0.00 | 0.11 | 1.00 | 0.00 | 0.22 | normal |
| 16584 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 4 | 4 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 16855 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 16962 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 7 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.29 | normal |
| 16966 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 17347 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 17536 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 6 | 6 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 17633 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 17665 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 2 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 17800 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 4 | 4 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 17905 | 160 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 3 | 5 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.60 | normal |
| 17974 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 4 | 4 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 18048 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 5 | 5 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 18169 | 6 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 3 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 1.00 | normal |
| 18381 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 4 | 9 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.22 | normal |
| 18430 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 4 | 5 | 0.00 | 0.0 | 0.00 | 0.20 | 1.00 | 0.00 | 0.40 | normal |
| 18723 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 3 | 0.00 | 0.0 | 0.00 | 0.33 | 1.00 | 0.00 | 0.67 | normal |
| 18802 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 4 | 4 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 19493 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 2 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
| 19513 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | normal |
93 rows × 46 columns
num_access_files_mask = conexiones_dataset["num_access_files"] > 1
num_access_files_filtrado_mayor_1 = conexiones_dataset[num_access_files_mask]
"attack" in num_access_files_filtrado_mayor_1.values
False
1.13.- 'count'
conexiones_dataset["count"].plot(kind = "box", figsize = (20, 10), color = "green")
plt.title("Número de conexiones count", fontsize = 25)
plt.xlabel("Conexiones count", fontsize = 20)
plt.ylabel("Número", fontsize = 20)
Text(0, 0.5, 'Número')
joker = conexiones_dataset[["count", "type"]]
joker_normal = joker.loc[joker["type"] == "normal"]
joker_attack = joker.loc[joker["type"] == "attack"]
joker_normal.plot(kind = "hist", bins = 500, x = "type", y = "count", figsize = (20, 10), color = "blue")
plt.title("Número de conexiones count vs Conexiones buenas", fontsize = 25)
plt.xlabel("Número de conexiones count", fontsize = 20)
plt.ylabel("Conexiones buenas", fontsize = 20)
Text(0, 0.5, 'Conexiones buenas')
joker_attack.plot(kind = "hist", bins = 500, x = "type", y = "count", figsize = (20, 10), color = "red")
plt.title("Número de conexiones count vs Conexiones maliciosas", fontsize = 25)
plt.xlabel("Número de conexiones count", fontsize = 20)
plt.ylabel("Conexiones maliciosas", fontsize = 20)
Text(0, 0.5, 'Conexiones maliciosas')
conexiones_dataset["count"].describe()
count 19534.000000 mean 32.808181 std 75.561680 min 0.000000 25% 1.000000 50% 5.000000 75% 16.000000 max 511.000000 Name: count, dtype: float64
El método "describe" ya apuntaba que el grueso de los valores del número de conexiones count de la conexión estaba entre 1 y 16, por lo que es lógico que el grueso de las conexiones buenas y maliciosas se encuentren también en torno a estos valores. Llama la atención que en lo referente a los outliers la correspondencia de éstos con conexiones maliciosas se presenta en una proporción mucho mayor que la del conjunto.
1.14.- 'srv_count'
conexiones_dataset["srv_count"].plot(kind = "box", figsize = (20, 10), color = "green")
plt.title("Número de conexiones srv count", fontsize = 25)
plt.xlabel("Conexiones srv count", fontsize = 20)
plt.ylabel("Número", fontsize = 20)
Text(0, 0.5, 'Número')
joker = conexiones_dataset[["srv_count", "type"]]
joker_normal = joker.loc[joker["type"] == "normal"]
joker_attack = joker.loc[joker["type"] == "attack"]
joker_normal.plot(kind = "hist", bins = 500, x = "type", y = "srv_count", figsize = (20, 10), color = "blue")
plt.title("Número de conexiones srv count vs Conexiones buenas", fontsize = 25)
plt.xlabel("Número de conexiones srv count", fontsize = 20)
plt.ylabel("Conexiones buenas", fontsize = 20)
Text(0, 0.5, 'Conexiones buenas')
joker_attack.plot(kind = "hist", bins = 500, x = "type", y = "srv_count", figsize = (20, 10), color = "red")
plt.title("Número de conexiones srv count vs Conexiones maliciosas", fontsize = 25)
plt.xlabel("Número de conexiones srv count", fontsize = 20)
plt.ylabel("Conexiones maliciosas", fontsize = 20)
Text(0, 0.5, 'Conexiones maliciosas')
conexiones_dataset["srv_count"].describe()
count 19534.000000 mean 29.523446 std 69.005003 min 0.000000 25% 2.000000 50% 6.000000 75% 18.000000 max 511.000000 Name: srv_count, dtype: float64
El método "describe" ya apuntaba que el grueso de los valores del número de conexiones srv count de la conexión estaba entre 2 y 18, por lo que es lógico que el grueso de las conexiones buenas y maliciosas se encuentren también en torno a estos valores. Llama la atención que en lo referente a los outliers la correspondencia de éstos con conexiones maliciosas se presenta de manera más patente, a diferencia de para el número de conexiones count, sólo a partir de valores en torno a 300 conexiones en una proporción mucho mayor que la del conjunto (para el número de conexiones srv count era más patente para todo el rango de outliers).
2.- Valores 0 o 1 (y 2!: 2.7.- 'su_attempted')
2.1.- 'protocol_type': tcp, udp e icmp
sns.countplot(x='protocol_type__tcp',
hue = 'type',
data=conexiones_dataset)
<AxesSubplot:xlabel='protocol_type__tcp', ylabel='count'>
sns.countplot(x='protocol_type__udp',
hue = 'type',
data=conexiones_dataset)
<AxesSubplot:xlabel='protocol_type__udp', ylabel='count'>
sns.countplot(x='protocol_type__icmp',
hue = 'type',
data=conexiones_dataset)
<AxesSubplot:xlabel='protocol_type__icmp', ylabel='count'>
2.2.- 'service': http, private, domain u, smtp, ftp data, telnet, ftp y other.
Llama la atención que para valores 0 http la proporción de conexiones maliciosas es mayor que la del conjunto.
sns.countplot(x='service__http',
hue = 'type',
data=conexiones_dataset)
<AxesSubplot:xlabel='service__http', ylabel='count'>
Llama la atención que para valores 1 private la proporción de conexiones maliciosas es mayor que la del conjunto.
sns.countplot(x='service__private',
hue = 'type',
data=conexiones_dataset)
<AxesSubplot:xlabel='service__private', ylabel='count'>
sns.countplot(x='service__domain_u',
hue = 'type',
data=conexiones_dataset)
<AxesSubplot:xlabel='service__domain_u', ylabel='count'>
sns.countplot(x='service__smtp',
hue = 'type',
data=conexiones_dataset)
<AxesSubplot:xlabel='service__smtp', ylabel='count'>
sns.countplot(x='service__ftp_data',
hue = 'type',
data=conexiones_dataset)
<AxesSubplot:xlabel='service__ftp_data', ylabel='count'>
sns.countplot(x='service__telnet',
hue = 'type',
data=conexiones_dataset)
<AxesSubplot:xlabel='service__telnet', ylabel='count'>
sns.countplot(x='service__ftp',
hue = 'type',
data=conexiones_dataset)
<AxesSubplot:xlabel='service__ftp', ylabel='count'>
pd.value_counts(conexiones_dataset["service__ftp"])
0 19299 1 235 Name: service__ftp, dtype: int64
service__ftp_mask = conexiones_dataset["service__ftp"] == 1
service__ftp_filtrado_1 = conexiones_dataset[service__ftp_mask]
"attack" in service__ftp_filtrado_1.values
True
sns.countplot(x='service__other',
hue = 'type',
data=conexiones_dataset)
<AxesSubplot:xlabel='service__other', ylabel='count'>
Llama la atención que para valores 1 other la proporción de conexiones maliciosas es mayor que la del conjunto.
2.3.- 'flag': OTH (el resto de flags sólo con valores cero, OTH es el único flag con valores 0 y 1).
sns.countplot(x='flag__OTH',
hue = 'type',
data=conexiones_dataset)
<AxesSubplot:xlabel='flag__OTH', ylabel='count'>
pd.value_counts(conexiones_dataset["flag__OTH"])
0 19399 1 135 Name: flag__OTH, dtype: int64
flag__OTH_mask = conexiones_dataset["flag__OTH"] == 1
flag__OTH_filtrado_1 = conexiones_dataset[flag__OTH_mask]
"attack" in flag__OTH_filtrado_1.values
True
2.4- 'land'
sns.countplot(x='land',
hue = 'type',
data=conexiones_dataset)
<AxesSubplot:xlabel='land', ylabel='count'>
pd.value_counts(conexiones_dataset["land"])
0 19532 1 2 Name: land, dtype: int64
2.5.- 'logged_in'
Llama la atención que para valores 0 logged in la proporción de conexiones maliciosas es mayor que la del conjunto.
sns.countplot(x='logged_in',
hue = 'type',
data=conexiones_dataset)
<AxesSubplot:xlabel='logged_in', ylabel='count'>
2.6.- 'root_shell'
sns.countplot(x='root_shell',
hue = 'type',
data=conexiones_dataset)
<AxesSubplot:xlabel='root_shell', ylabel='count'>
pd.value_counts(conexiones_dataset["root_shell"])
0 19494 1 40 Name: root_shell, dtype: int64
2.7.- 'su_attempted'
sns.countplot(x='su_attempted',
hue = 'type',
data=conexiones_dataset)
<AxesSubplot:xlabel='su_attempted', ylabel='count'>
pd.value_counts(conexiones_dataset["su_attempted"])
0 19515 2 15 1 4 Name: su_attempted, dtype: int64
Segun la información oficial del dataset: su_attempted toma el valor 1 si "su root" command attempted y 0 en cualquier otro caso. Si se toma esta afirmación como cierta, los 15 valores 2 serían por lo tanto erróneos. No se dispone de un criterio para saber si los 2 corresponden a ceros, unos o a ambos y en qué proporción, por lo que se toma la decisión de no sustituirlos por ceros, unos o por alguna proporción de ambos. Es posible que futuros datos pudieran llegar por alguna razón con valores 2, por lo que se decide dejarlos.
su_attempted_mask = conexiones_dataset["su_attempted"] == 2
su_attempted_filtrado_2 = conexiones_dataset[su_attempted_mask]
"attack" in su_attempted_filtrado_2.values
False
2.8.- 'is_guest_login'
sns.countplot(x='is_guest_login',
hue = 'type',
data=conexiones_dataset)
<AxesSubplot:xlabel='is_guest_login', ylabel='count'>
pd.value_counts(conexiones_dataset["is_guest_login"])
0 19321 1 213 Name: is_guest_login, dtype: int64
is_guest_login_mask = conexiones_dataset["is_guest_login"] == 1
is_guest_login_filtrado_1 = conexiones_dataset[is_guest_login_mask]
"attack" in is_guest_login_filtrado_1.values
True
3.- Porcentajes en tanto por uno.
3.1.- 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate'.
columnas_4 = ['serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'type']
sns.pairplot(conexiones_dataset[columnas_4], hue = "type", height = 4, diag_kind = "hist")
<seaborn.axisgrid.PairGrid at 0x25b62b5dca0>
3.2.- 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate'.
columnas_3 = ['same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'type']
sns.pairplot(conexiones_dataset[columnas_3], hue = "type", height = 5, diag_kind = "hist")
<seaborn.axisgrid.PairGrid at 0x25b6340dd30>
Matriz de correlación: se comienza con un primer acercamiento numérico para su análisis y se continua con un posterior acercameinto visual:
conexiones_dataset[features].corr()
| duration | protocol_type__tcp | protocol_type__udp | protocol_type__icmp | service__http | service__private | service__domain_u | service__smtp | service__ftp_data | service__telnet | ... | is_guest_login | count | srv_count | serror_rate | srv_serror_rate | rerror_rate | srv_rerror_rate | same_srv_rate | diff_srv_rate | srv_diff_host_rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| duration | 1.000000 | -0.092964 | 0.105439 | -0.018720 | -0.126718 | 0.025058 | -0.041303 | -0.036581 | -0.029809 | 0.201684 | ... | -0.009266 | -0.046560 | -0.047386 | -0.022574 | -0.024643 | 0.101936 | 0.108332 | -0.020321 | 0.061776 | -0.052632 |
| protocol_type__tcp | -0.092964 | 1.000000 | -0.921733 | -0.318005 | 0.552789 | -0.178683 | -0.737212 | 0.160307 | 0.132226 | 0.058699 | ... | 0.052961 | -0.395467 | -0.573619 | 0.114647 | 0.115143 | 0.119030 | 0.120031 | 0.047180 | -0.144593 | 0.072514 |
| protocol_type__udp | 0.105439 | -0.921733 | 1.000000 | -0.074578 | -0.509523 | 0.202437 | 0.799811 | -0.147760 | -0.121877 | -0.054105 | ... | -0.048816 | 0.401991 | 0.582976 | -0.105991 | -0.106131 | -0.109688 | -0.110637 | -0.003905 | 0.080840 | -0.083474 |
| protocol_type__icmp | -0.018720 | -0.318005 | -0.074578 | 1.000000 | -0.175790 | -0.035437 | -0.059648 | -0.050978 | -0.042048 | -0.018667 | ... | -0.016842 | 0.034144 | 0.049787 | -0.035682 | -0.036616 | -0.037915 | -0.038171 | -0.111768 | 0.174167 | 0.017607 |
| service__http | -0.126718 | 0.552789 | -0.509523 | -0.175790 | 1.000000 | -0.242107 | -0.407523 | -0.348291 | -0.287279 | -0.127532 | ... | -0.115067 | -0.341299 | -0.275223 | -0.172894 | -0.172822 | 0.038931 | 0.039045 | 0.322926 | -0.235061 | -0.021911 |
| service__private | 0.025058 | -0.178683 | 0.202437 | -0.035437 | -0.242107 | 1.000000 | -0.082151 | -0.070210 | -0.057911 | -0.025709 | ... | -0.023196 | 0.295328 | 0.103346 | 0.300268 | 0.302540 | 0.099937 | 0.101022 | -0.410885 | 0.127412 | -0.101217 |
| service__domain_u | -0.041303 | -0.737212 | 0.799811 | -0.059648 | -0.407523 | -0.082151 | 1.000000 | -0.118180 | -0.097478 | -0.043274 | ... | -0.039044 | 0.367965 | 0.569801 | -0.086190 | -0.084885 | -0.087897 | -0.088489 | 0.092971 | -0.039302 | -0.020544 |
| service__smtp | -0.036581 | 0.160307 | -0.147760 | -0.050978 | -0.348291 | -0.070210 | -0.118180 | 1.000000 | -0.083310 | -0.036984 | ... | -0.033369 | -0.129546 | -0.128223 | -0.045661 | -0.048513 | -0.068204 | -0.069536 | 0.065150 | -0.004524 | 0.342565 |
| service__ftp_data | -0.029809 | 0.132226 | -0.121877 | -0.042048 | -0.287279 | -0.057911 | -0.097478 | -0.083310 | 1.000000 | -0.030505 | ... | -0.027524 | -0.085534 | -0.093477 | -0.026518 | -0.028451 | -0.058830 | -0.059512 | 0.032636 | 0.009469 | -0.083979 |
| service__telnet | 0.201684 | 0.058699 | -0.054105 | -0.018667 | -0.127532 | -0.025709 | -0.043274 | -0.036984 | -0.030505 | 1.000000 | ... | -0.012219 | -0.032106 | -0.045952 | 0.068024 | 0.069161 | 0.017690 | 0.016555 | -0.015845 | 0.009854 | -0.045069 |
| service__ftp | -0.010020 | 0.055661 | -0.051305 | -0.017700 | -0.120932 | -0.024378 | -0.041034 | -0.035070 | -0.028927 | -0.012841 | ... | 0.951499 | -0.036553 | -0.044927 | 0.002297 | 0.003030 | -0.013154 | -0.013356 | 0.010205 | 0.001813 | -0.041748 |
| service__other | 0.219035 | -0.288103 | 0.111355 | 0.468577 | -0.375156 | -0.075626 | -0.127296 | -0.108794 | -0.089736 | -0.039837 | ... | -0.035943 | 0.174813 | -0.002490 | 0.204358 | 0.204834 | 0.071145 | 0.073143 | -0.426597 | 0.328622 | -0.096657 |
| src_bytes | 0.012285 | 0.023974 | -0.022271 | -0.007201 | -0.050803 | -0.010618 | -0.017866 | -0.011605 | 0.133422 | -0.005318 | ... | -0.004671 | -0.019484 | -0.018973 | -0.007936 | -0.007634 | -0.010515 | -0.010484 | 0.010992 | -0.000419 | -0.010376 |
| dst_bytes | 0.035365 | 0.034260 | -0.031499 | -0.011090 | 0.007600 | -0.015101 | -0.025175 | -0.019985 | 0.000486 | 0.156551 | ... | -0.003039 | -0.025287 | -0.022726 | 0.013419 | 0.012781 | -0.014952 | -0.014668 | 0.019911 | -0.013734 | -0.010471 |
| flag__SF | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| flag__S0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| flag__REJ | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| flag__RSTR | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| flag__RSTO | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| flag__OTH | -0.005103 | 0.042079 | -0.038785 | -0.013381 | -0.016959 | -0.003761 | -0.031021 | 0.009880 | 0.008363 | 0.092549 | ... | 0.020992 | -0.032991 | -0.031492 | 0.290490 | 0.269596 | -0.017705 | -0.017502 | 0.005720 | 0.012147 | -0.004116 |
| land | -0.001181 | 0.005104 | -0.004705 | -0.001623 | -0.011090 | -0.002236 | -0.003763 | -0.003216 | -0.002653 | -0.001178 | ... | -0.001062 | -0.004260 | -0.004036 | 0.047841 | 0.048367 | -0.002392 | -0.002408 | 0.003074 | -0.002214 | 0.033331 |
| wrong_fragment | -0.004498 | -0.076406 | 0.067384 | 0.031733 | -0.042236 | 0.146304 | -0.014331 | -0.012248 | -0.010103 | -0.004485 | ... | -0.004047 | 0.008958 | 0.009568 | -0.004327 | -0.008798 | -0.009042 | -0.009171 | 0.007601 | -0.007639 | -0.009297 |
| urgent | 0.007078 | 0.005104 | -0.004705 | -0.001623 | -0.011090 | -0.002236 | -0.003763 | -0.003216 | -0.002653 | 0.086955 | ... | -0.001062 | -0.004260 | -0.004183 | -0.002352 | -0.002310 | -0.002392 | -0.002408 | 0.003074 | -0.002214 | -0.004636 |
| hot | 0.000634 | 0.045623 | -0.042052 | -0.014508 | -0.092783 | -0.019982 | -0.033634 | -0.027599 | -0.023399 | 0.003443 | ... | 0.809964 | -0.037860 | -0.037110 | -0.018322 | -0.018116 | -0.019900 | -0.019693 | 0.020930 | -0.001305 | -0.034267 |
| num_failed_logins | 0.011642 | 0.014079 | -0.012977 | -0.004477 | -0.030589 | -0.006166 | -0.010379 | -0.008871 | -0.007317 | 0.135668 | ... | 0.022666 | -0.011504 | -0.011519 | -0.006489 | -0.006372 | 0.011700 | 0.011619 | 0.000435 | 0.004110 | -0.012788 |
| logged_in | -0.103078 | 0.717336 | -0.661192 | -0.228116 | 0.612192 | -0.314175 | -0.528829 | 0.212168 | -0.030802 | 0.008864 | ... | 0.070688 | -0.468558 | -0.395707 | -0.252270 | -0.253983 | -0.308280 | -0.302235 | 0.362720 | -0.196960 | 0.108034 |
| num_compromised | 0.068849 | 0.006965 | -0.006420 | -0.002215 | -0.015068 | -0.003050 | -0.005134 | -0.004382 | -0.003619 | 0.118334 | ... | -0.001450 | -0.005744 | -0.005706 | 0.004974 | 0.004944 | -0.003236 | -0.003248 | 0.004050 | -0.002997 | -0.006308 |
| root_shell | 0.135206 | 0.022849 | -0.021061 | -0.007266 | -0.008703 | -0.010007 | -0.016844 | -0.014396 | -0.011874 | 0.201855 | ... | -0.004756 | -0.016296 | -0.015491 | 0.001097 | 0.001343 | -0.010707 | -0.010779 | 0.013760 | -0.009910 | -0.018374 |
| su_attempted | 0.200005 | 0.015345 | -0.014144 | -0.004880 | -0.033340 | -0.006721 | -0.011313 | -0.009669 | -0.007975 | 0.261426 | ... | -0.003194 | -0.012807 | -0.012576 | 0.010681 | 0.010980 | -0.007191 | -0.007239 | 0.009242 | -0.006656 | -0.013938 |
| num_root | 0.072834 | 0.007348 | -0.006773 | -0.002337 | -0.015964 | -0.003218 | -0.005417 | -0.004611 | -0.001765 | 0.120695 | ... | -0.001529 | -0.006064 | -0.006018 | 0.004807 | 0.004789 | -0.003443 | -0.003466 | 0.004201 | -0.002931 | -0.006580 |
| num_file_creations | 0.203931 | 0.014921 | -0.013753 | -0.004745 | -0.032417 | -0.006535 | -0.011000 | -0.002313 | -0.000108 | 0.217807 | ... | -0.003106 | -0.011939 | -0.012043 | -0.004263 | -0.005045 | -0.006601 | -0.006649 | 0.007300 | -0.005771 | -0.009254 |
| num_shells | -0.002046 | 0.008842 | -0.008150 | -0.002812 | -0.019210 | -0.003872 | -0.006518 | -0.005571 | 0.054957 | -0.002040 | ... | -0.001840 | -0.005407 | -0.005298 | -0.001611 | -0.004001 | -0.004143 | -0.004171 | -0.005010 | 0.009248 | -0.008031 |
| num_access_files | 0.175649 | 0.026319 | -0.024259 | -0.008369 | -0.017293 | -0.011527 | -0.019402 | 0.014905 | -0.008188 | 0.184498 | ... | -0.005478 | -0.020588 | -0.020431 | -0.010477 | -0.011573 | -0.011298 | -0.008423 | 0.013755 | -0.009098 | 0.000332 |
| num_outbound_cmds | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| is_host_login | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| is_guest_login | -0.009266 | 0.052961 | -0.048816 | -0.016842 | -0.115067 | -0.023196 | -0.039044 | -0.033369 | -0.027524 | -0.012219 | ... | 1.000000 | -0.044089 | -0.043330 | -0.012183 | -0.011624 | -0.024818 | -0.024985 | 0.022252 | -0.000435 | -0.038858 |
| count | -0.046560 | -0.395467 | 0.401991 | 0.034144 | -0.341299 | 0.295328 | 0.367965 | -0.129546 | -0.085534 | -0.032106 | ... | -0.044089 | 1.000000 | 0.785925 | 0.299917 | 0.297688 | 0.060233 | 0.062685 | -0.355170 | 0.059312 | -0.173532 |
| srv_count | -0.047386 | -0.573619 | 0.582976 | 0.049787 | -0.275223 | 0.103346 | 0.569801 | -0.128223 | -0.093477 | -0.045952 | ... | -0.043330 | 0.785925 | 1.000000 | -0.069271 | -0.067794 | -0.087381 | -0.087952 | 0.099415 | -0.080458 | -0.142354 |
| serror_rate | -0.022574 | 0.114647 | -0.105991 | -0.035682 | -0.172894 | 0.300268 | -0.086190 | -0.045661 | -0.026518 | 0.068024 | ... | -0.012183 | 0.299917 | -0.069271 | 1.000000 | 0.973353 | -0.048024 | -0.048218 | -0.614157 | 0.059548 | -0.072965 |
| srv_serror_rate | -0.024643 | 0.115143 | -0.106131 | -0.036616 | -0.172822 | 0.302540 | -0.084885 | -0.048513 | -0.028451 | 0.069161 | ... | -0.011624 | 0.297688 | -0.067794 | 0.973353 | 1.000000 | -0.042876 | -0.050734 | -0.605124 | 0.055649 | -0.078709 |
| rerror_rate | 0.101936 | 0.119030 | -0.109688 | -0.037915 | 0.038931 | 0.099937 | -0.087897 | -0.068204 | -0.058830 | 0.017690 | ... | -0.024818 | 0.060233 | -0.087381 | -0.048024 | -0.042876 | 1.000000 | 0.984429 | -0.153387 | 0.081554 | 0.029832 |
| srv_rerror_rate | 0.108332 | 0.120031 | -0.110637 | -0.038171 | 0.039045 | 0.101022 | -0.088489 | -0.069536 | -0.059512 | 0.016555 | ... | -0.024985 | 0.062685 | -0.087952 | -0.048218 | -0.050734 | 0.984429 | 1.000000 | -0.156283 | 0.085347 | 0.027880 |
| same_srv_rate | -0.020321 | 0.047180 | -0.003905 | -0.111768 | 0.322926 | -0.410885 | 0.092971 | 0.065150 | 0.032636 | -0.015845 | ... | 0.022252 | -0.355170 | 0.099415 | -0.614157 | -0.605124 | -0.153387 | -0.156283 | 1.000000 | -0.551025 | 0.106565 |
| diff_srv_rate | 0.061776 | -0.144593 | 0.080840 | 0.174167 | -0.235061 | 0.127412 | -0.039302 | -0.004524 | 0.009469 | 0.009854 | ... | -0.000435 | 0.059312 | -0.080458 | 0.059548 | 0.055649 | 0.081554 | 0.085347 | -0.551025 | 1.000000 | -0.040901 |
| srv_diff_host_rate | -0.052632 | 0.072514 | -0.083474 | 0.017607 | -0.021911 | -0.101217 | -0.020544 | 0.342565 | -0.083979 | -0.045069 | ... | -0.038858 | -0.173532 | -0.142354 | -0.072965 | -0.078709 | 0.029832 | 0.027880 | 0.106565 | -0.040901 | 1.000000 |
45 rows × 45 columns
Llaman la atención determinadas variables que arrojan resultados no posibles de calcular (NaN). Se procede a su análisis:
conexiones_dataset["flag__SF"]
0 0
1 0
2 0
3 0
4 0
..
19529 0
19530 0
19531 0
19532 0
19533 0
Name: flag__SF, Length: 19534, dtype: int64
pd.value_counts(conexiones_dataset["flag__SF"])
0 19534 Name: flag__SF, dtype: int64
La variable "flagS0" presenta todos sus valores cero. Se procede de igual manera con "flagREJ", "flagRSTO", "flagRSTO" y "flag__OTH":
conexiones_dataset["flag__S0"]
0 0
1 0
2 0
3 0
4 0
..
19529 0
19530 0
19531 0
19532 0
19533 0
Name: flag__S0, Length: 19534, dtype: int64
pd.value_counts(conexiones_dataset["flag__S0"])
0 19534 Name: flag__S0, dtype: int64
conexiones_dataset["flag__REJ"]
0 0
1 0
2 0
3 0
4 0
..
19529 0
19530 0
19531 0
19532 0
19533 0
Name: flag__REJ, Length: 19534, dtype: int64
pd.value_counts(conexiones_dataset["flag__REJ"])
0 19534 Name: flag__REJ, dtype: int64
conexiones_dataset["flag__RSTR"]
0 0
1 0
2 0
3 0
4 0
..
19529 0
19530 0
19531 0
19532 0
19533 0
Name: flag__RSTR, Length: 19534, dtype: int64
pd.value_counts(conexiones_dataset["flag__RSTR"])
0 19534 Name: flag__RSTR, dtype: int64
conexiones_dataset["flag__RSTO"]
0 0
1 0
2 0
3 0
4 0
..
19529 0
19530 0
19531 0
19532 0
19533 0
Name: flag__RSTO, Length: 19534, dtype: int64
pd.value_counts(conexiones_dataset["flag__RSTO"])
0 19534 Name: flag__RSTO, dtype: int64
conexiones_dataset["num_outbound_cmds"]
0 0
1 0
2 0
3 0
4 0
..
19529 0
19530 0
19531 0
19532 0
19533 0
Name: num_outbound_cmds, Length: 19534, dtype: int64
pd.value_counts(conexiones_dataset["num_outbound_cmds"])
0 19534 Name: num_outbound_cmds, dtype: int64
conexiones_dataset["is_host_login"]
0 0
1 0
2 0
3 0
4 0
..
19529 0
19530 0
19531 0
19532 0
19533 0
Name: is_host_login, Length: 19534, dtype: int64
pd.value_counts(conexiones_dataset["is_host_login"])
0 19534 Name: is_host_login, dtype: int64
features
['duration', 'protocol_type__tcp', 'protocol_type__udp', 'protocol_type__icmp', 'service__http', 'service__private', 'service__domain_u', 'service__smtp', 'service__ftp_data', 'service__telnet', 'service__ftp', 'service__other', 'src_bytes', 'dst_bytes', 'flag__SF', 'flag__S0', 'flag__REJ', 'flag__RSTR', 'flag__RSTO', 'flag__OTH', 'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate']
Se eliminan estas variables del análisis de correlación:
features_corr = list()
eliminar = ['flag__SF', 'flag__S0', 'flag__REJ', 'flag__RSTR', 'flag__RSTO', 'num_outbound_cmds', 'is_host_login']
for i in features:
if i not in eliminar:
features_corr.append(i)
features_corr
['duration', 'protocol_type__tcp', 'protocol_type__udp', 'protocol_type__icmp', 'service__http', 'service__private', 'service__domain_u', 'service__smtp', 'service__ftp_data', 'service__telnet', 'service__ftp', 'service__other', 'src_bytes', 'dst_bytes', 'flag__OTH', 'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate']
corrmat = conexiones_dataset[features_corr].corr()
f, ax = plt.subplots(figsize=(17, 17))
sns.heatmap(corrmat, vmax=.8, square=True)
<AxesSubplot:>
El mapa de calor es una forma visual muy útil para conocer las variables y sus relaciones. A primera vista hay diversas variables que llaman la atención: 'protocol_typetcp' vs 'protocol_typeudp', 'protocol_typetcp' vs 'servicedomain_u', 'protocol_typeudp' vs 'servicedomain_u', 'protocol_typeudp' vs 'logged_in', 'serviceftp' vs 'hot', 'service__ftp' vs 'is_guest_login', 'hot' vs 'is_guest_login', 'num_compromised' vs 'num_root', 'count' vs 'srv_count', 'serror_rate' vs 'srv_serror_rate', 'rerror_rate' vs 'srv_rerror_rate', 'serror_rate' vs 'same_srv_rate' y 'srv_serror_rate' vs 'same_srv_rate'. En estos casos parece haber una correlación significativa (positiva o negativa). En realidad son tan fuertes que podrían indicar multicolinealidad, es decir, que básicamente podrían ofrecer la misma información.
sns.set()
columnas = ['protocol_type__tcp', 'protocol_type__udp', 'service__domain_u', 'logged_in', 'service__ftp', 'hot', 'is_guest_login', 'num_compromised', 'num_root', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'type']
sns.pairplot(conexiones_dataset[columnas],
hue = "type",
height = 2,
diag_kind = "hist")
plt.show()
¿Se deben eliminar las variables detectadas que presentan valores siempre igual a cero? Se comprueba para "nuevas_conexiones_csv" qué valores se tienen para estas variables:
nuevas_conexiones_dataset = pd.read_csv("nuevas_conexiones.csv")
print(pd.value_counts(nuevas_conexiones_dataset["flag__SF"]))
print(pd.value_counts(nuevas_conexiones_dataset["flag__S0"]))
print(pd.value_counts(nuevas_conexiones_dataset["flag__REJ"]))
print(pd.value_counts(nuevas_conexiones_dataset["flag__RSTR"]))
print(pd.value_counts(nuevas_conexiones_dataset["flag__RSTO"]))
print(pd.value_counts(nuevas_conexiones_dataset["num_outbound_cmds"]))
print(pd.value_counts(nuevas_conexiones_dataset["is_host_login"]))
0.0 3 Name: flag__SF, dtype: int64 0.0 3 Name: flag__S0, dtype: int64 0.0 3 Name: flag__REJ, dtype: int64 0.0 3 Name: flag__RSTR, dtype: int64 0.0 3 Name: flag__RSTO, dtype: int64 0.0 3 Name: num_outbound_cmds, dtype: int64 0.0 3 Name: is_host_login, dtype: int64
Para las tres conexiones se tiene valores también igual a cero. Esta confirmación junto con el hecho de la no aportación de valor añadido de variables con valor siempre constante, precipita la decisión de eliminarlas:
conexiones_limpio_dataset = conexiones_dataset.drop(["flag__SF", "flag__S0", "flag__REJ", "flag__RSTR", "flag__RSTO", "num_outbound_cmds", "is_host_login"], axis =1)
conexiones_limpio_dataset_columnas_features = conexiones_limpio_dataset.loc[:, conexiones_limpio_dataset.columns != 'type']
columns_names_features = conexiones_limpio_dataset_columnas_features.columns.values
features = list(columns_names_features)
features
['duration', 'protocol_type__tcp', 'protocol_type__udp', 'protocol_type__icmp', 'service__http', 'service__private', 'service__domain_u', 'service__smtp', 'service__ftp_data', 'service__telnet', 'service__ftp', 'service__other', 'src_bytes', 'dst_bytes', 'flag__OTH', 'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate']
Puesto que sklearn (y en general en machine learning) las máquinas sólo entienden de números, vamos a convertir la columna type (que ahora mismo son strings con los nombres "normal" y "attack") a números correspondientes (por ejemplo: 0 para normal y 1 para attack). De manera coloquial se suele decir que a scikit-learn no le gustan los strings.
pasar_a_numeros = []
for row in conexiones_limpio_dataset["type"]:
if row == "normal":
pasar_a_numeros.append(0)
elif row == "attack":
pasar_a_numeros.append(1)
else:
print("Error") # Hombre prevenido vale por dos
conexiones_limpio_dataset["type"] = pasar_a_numeros
conexiones_limpio_dataset
| duration | protocol_type__tcp | protocol_type__udp | protocol_type__icmp | service__http | service__private | service__domain_u | service__smtp | service__ftp_data | service__telnet | ... | count | srv_count | serror_rate | srv_serror_rate | rerror_rate | srv_rerror_rate | same_srv_rate | diff_srv_rate | srv_diff_host_rate | type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 274 | 275 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.01 | 0 |
| 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 7 | 17 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.24 | 0 |
| 2 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 9 | 29 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.10 | 0 |
| 3 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 3 | 3 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.00 | 0 |
| 4 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 12 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.08 | 0.67 | 0.00 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 19529 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.00 | 0 |
| 19530 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.00 | 1 |
| 19531 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 188 | 187 | 0.0 | 0.0 | 0.0 | 0.0 | 0.99 | 0.01 | 0.00 | 0 |
| 19532 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.00 | 0 |
| 19533 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 43 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.05 | 0 |
19534 rows × 39 columns
Se tiene por lo tanto en este momento un DataFrame puramente numérico, y se procede a separar en train y test:
from sklearn.model_selection import train_test_split
train_y_val, test = train_test_split(conexiones_limpio_dataset,
train_size = 0.8,
test_size = 0.2,
random_state = 101)
Se han considerado como razonables los porcentajes del 80% de los valores para uso de entrenamiento y validación del modelo y del 20% para su posterior testeo.
Como buena praxis se empleará una semilla pseudoaleatoria para la selección de los datos a partir de este momento. Esto cobra especial relevancia a la hora de poder establecer comparativas entre los modelos que se vayan a probar de cara a la elección del modelo ganador: random_state = 101
train_y_val
| duration | protocol_type__tcp | protocol_type__udp | protocol_type__icmp | service__http | service__private | service__domain_u | service__smtp | service__ftp_data | service__telnet | ... | count | srv_count | serror_rate | srv_serror_rate | rerror_rate | srv_rerror_rate | same_srv_rate | diff_srv_rate | srv_diff_host_rate | type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6353 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 10 | 16 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.12 | 0 |
| 13618 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 10 | 1 | 0.1 | 0.0 | 0.0 | 0.0 | 0.10 | 0.20 | 0.00 | 0 |
| 5081 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 5 | 17 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.24 | 0 |
| 14431 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 68 | 146 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.01 | 0 |
| 18226 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 11 | 12 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.17 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5695 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 18 | 20 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.15 | 0 |
| 8006 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 107 | 2 | 1.0 | 1.0 | 0.0 | 0.0 | 0.02 | 0.07 | 0.00 | 0 |
| 17745 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 33 | 33 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.00 | 0 |
| 17931 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 3 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 1.00 | 0 |
| 13151 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.00 | 0 |
15627 rows × 39 columns
test
| duration | protocol_type__tcp | protocol_type__udp | protocol_type__icmp | service__http | service__private | service__domain_u | service__smtp | service__ftp_data | service__telnet | ... | count | srv_count | serror_rate | srv_serror_rate | rerror_rate | srv_rerror_rate | same_srv_rate | diff_srv_rate | srv_diff_host_rate | type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10087 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | 0 |
| 5 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.0 | 0.0 | 1.0 | 1.0 | 1.00 | 0.00 | 0.0 | 0 |
| 4091 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 3 | 2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.67 | 0.67 | 0.0 | 0 |
| 5266 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 2 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 1.0 | 0 |
| 10339 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8274 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 6 | 6 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | 0 |
| 11174 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 9 | 9 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | 0 |
| 10151 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | 0 |
| 17918 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 184 | 184 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | 0 |
| 7458 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.00 | 0.00 | 0.0 | 0 |
3907 rows × 39 columns
¿Cómo de bien o de mal mantiene la proporción de conexiones buenas y maliciosa tanto el conjunto de datos de entrenamiento y validación como el de test? Se recuerda que la proporción era en el global del dataset de normal: 93.590662 % y attack: 6.409338 %
100 * train_y_val["type"].value_counts() / len(train_y_val["type"])
0 93.568823 1 6.431177 Name: type, dtype: float64
100 * test["type"].value_counts() / len(test["type"])
0 93.678014 1 6.321986 Name: type, dtype: float64
Las dos clases están representadas en una proporción razonablemente buena. Se prueba cambiando las semillas (2001, 753, 777) pero no se consiguen valores a priori mejores. La literatura consultada al respecto tampoco anima al cambio, a no ser que la proporción distara mucho de la de origen, poniendo así en jaque una representatividad adecuada (no es el caso).
Antes de realizar modelos o validaciones cruzadas se deben plantear una serie de decisiones sobre los datos:
1.- ¿Descartar variables directamente por alta correlación? Se toma como premisa de trabajo no eliminar variables por alta correlación.
2.- ¿Quitar outliers? Se toma como premisa de trabajo no eliminar ouliers.
3.- ¿Estandarizar las features? Sí se realizará.
4.- ¿Seleccionar features / hacer reducción de dimensionalidad? Sí se realizará la selección de features pero al tener 45 variables no tiene sentido relizar una reducción de dimensionalidad.
Llegados a este punto, a partir de ahora el objetivo es probar un gran número de combinaciones de modelos e hiperparámetros.
from sklearn.pipeline import Pipeline
# Árbol de decisión: no requiere de selección previa de features ni Pipeline como tal.
from sklearn.tree import DecisionTreeClassifier
arbol = DecisionTreeClassifier()
# Random Forest: no requiere de selección previa de features ni Pipeline como tal.
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier()
# Gradient Boosting Trees: no requiere de selección previa de features ni Pipeline como tal.
from sklearn.ensemble import GradientBoostingClassifier
gradient_boosting = GradientBoostingClassifier()
# A partir de aquí modelos que ya sí requieren de selección previa de features y de un Pipeline como tal.
# Regresión Logística: se crearán dos Pipelines, una con seleccionador RFECV y otra con SelectKBest.
# Primer paso de las Pipelines: StandardScaler.
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFECV, SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
logreg_rfecv = Pipeline(steps=[("scaler",StandardScaler()),
("rfecv",RFECV(estimator=LogisticRegression())),
("logreg",LogisticRegression())
]
)
logreg_kbest = Pipeline(steps=[("scaler",StandardScaler()),
("kbest",SelectKBest()),
("logreg",LogisticRegression())
]
)
# Nearest Neighbors. Dos Pipelines, uno sin selección y otro con KBest.
from sklearn.neighbors import KNeighborsClassifier
neighbors = Pipeline(steps=[("scaler",StandardScaler()),
("knn",KNeighborsClassifier())
]
)
neighbors_kbest = Pipeline(steps=[("scaler",StandardScaler()),
("kbest",SelectKBest()),
("knn",KNeighborsClassifier())
]
)
# SVMs, sin selección y con SelectKBest.
from sklearn.svm import SVC
svm = Pipeline(steps=[("scaler",StandardScaler()),
("svm",SVC())
]
)
svm_kbest = Pipeline(steps=[("scaler",StandardScaler()),
("kbest",SelectKBest()),
("svm",SVC())])
# Naïve Bayes
from sklearn.naive_bayes import GaussianNB
nb = Pipeline(steps=[("nb",GaussianNB())])
# Perceptrón multicapa sin selección.
from sklearn.neural_network import MLPClassifier
mlp = Pipeline(steps=[("scaler",StandardScaler()),
("mlp",MLPClassifier())
]
)
Para cada una de las Pipelines, se seleccionan los hiperparámetros a probar.
# Árbol de decisión.
grid_arbol = {"max_depth":list(range(1,11)) # Profundidades de 1 a 10
}
# Random Forest.
grid_random_forest = {"n_estimators": [425], # Árboles
"max_depth": [50], # Se prueban distintas profundidades
"max_features": ["sqrt"], # Features en cada split de cada árbol.
"criterion": ["entropy"],
"min_samples_split" : [7],
"class_weight": ["balanced"]
}
# Gradient Boosting Trees.
grid_gradient_boosting = {"n_estimators": [815],
"subsample": [0.1], # Porcentaje de entrenamiento.
"max_features": ["auto"],
"min_samples_leaf": [75]
}
# Regresión logística.
grid_logreg_rfecv = {"rfecv__step": [1], # Probar a quitar features de una en una; lo más conservador.
"rfecv__cv": [5], # El número de folds en la CV interna del RFECV.
"logreg__penalty": ["l1","l2"], # Regularizaciones L1 y L2.
"logreg__C": [0.1, 0.5, 1.0, 5.0], # Fuerza de la regularización elegida.
"logreg__fit_intercept": [True],
"logreg__max_iter": [50,100,500],
"logreg__solver": ["liblinear"] # En general es el más rápido.
}
grid_logreg_kbest = {"kbest__score_func": [f_classif], # ANOVA para seleccionar las K mejores features.
"kbest__k": [15, 25, 35, 45], # El número de features con las que quedarse (las más importantes).
"logreg__penalty": ["l1","l2"],
"logreg__C": [0.1, 0.5, 1.0, 5.0],
"logreg__fit_intercept": [True],
"logreg__max_iter": [50,100,500],
"logreg__solver": ["liblinear"]
}
# Nearest Neighbors.
grid_neighbors = {"knn__n_neighbors": [3,5,7,9,11], # Siempre mejor impares
"knn__weights": ["uniform","distance"] # Ponderar o no las clasificaciones en
# función de la inversa de la distancia a cada
# vecino
}
grid_neigbors_kbest = {"kbest__score_func": [f_classif],
"kbest__k": [1,2,3],
"knn__n_neighbors": [3,5,7,9,11],
"knn__weights": ["uniform","distance"]
}
# SVMs:
grid_svm = {"svm__C": [1, 100],
"svm__kernel": ["rbf"],
"svm__gamma": [5]
}
grid_svm_kbest = {"kbest__score_func": [f_classif],
"kbest__k": [1,2,3],
"svm__C": [1, 100],
"svm__kernel": ["rbf"],
"svm__gamma": [5]
}
# Naïve Bayes.
grid_nb = {"nb__var_smoothing": [1e-09, 2e-9]
}
# MLP.
grid_mlp = {"mlp__hidden_layer_sizes": [(100,100,100)],
"mlp__activation": ["relu"],
"mlp__solver": ["adam"],
"mlp__alpha": [0.00001, 0.0001],
"mlp__validation_fraction": [0.1],
"mlp__early_stopping": [True],
"mlp__max_iter": [6000],
"mlp__learning_rate_init": [0.00001, 0.001]
}
Se procede al ensamblado de todas estas posibles combinaciones.
Se construirá cada GridSearchCV con su Pipeline y sus hiperparámetros correspondientes. Al presentar desbalanceo no se va a utilizar accuracy, sino F1-score o curva ROC. El área bajo la curva ROC es muy buena métrica: AUC (area under the curve): entre 0.5 (malo) y 1 (el mejor): es muy visual. También la métrica F1-score se usa mucho: cuanto más cercano a 1 mejor.
Para poder comparar todos los modelos de igual manera es importante establecer una Cross Validation idéntica para todos, con una misma semilla aleatoria.
from sklearn.model_selection import KFold
# CrossValidation: se consideran 10 experimentos adecuados.
config_cross_validation = KFold(10, shuffle = True, random_state = 320)
from sklearn.model_selection import GridSearchCV
# Se paralelizan tareas para ir más rápido con n_jobs: -1 y se encarga de máximizarlo.
gs_arbol = GridSearchCV(arbol,
grid_arbol,
cv=config_cross_validation,
scoring="roc_auc",
verbose=1,
n_jobs=-1)
gs_random_forest = GridSearchCV(random_forest,
grid_random_forest,
cv=config_cross_validation,
scoring="roc_auc",
verbose=1,
n_jobs=-1)
gs_gradient_boosting = GridSearchCV(gradient_boosting,
grid_gradient_boosting,
cv=config_cross_validation,
scoring="roc_auc",
verbose=1,
n_jobs=-1)
gs_logreg_rfecv = GridSearchCV(logreg_rfecv,
grid_logreg_rfecv,
cv=config_cross_validation,
scoring="roc_auc",
verbose=1,
n_jobs=-1)
gs_logreg_kbest = GridSearchCV(logreg_kbest,
grid_logreg_kbest,
cv=config_cross_validation,
scoring="roc_auc",
verbose=1,
n_jobs=-1)
gs_neighbors = GridSearchCV(neighbors,
grid_neighbors,
cv=config_cross_validation,
scoring="roc_auc",
verbose=1,
n_jobs=-1)
gs_neighbors_kbest = GridSearchCV(neighbors_kbest,
grid_neigbors_kbest,
cv=config_cross_validation,
scoring="roc_auc",
verbose=1,
n_jobs=-1)
gs_svm = GridSearchCV(svm,
grid_svm,
cv=config_cross_validation,
scoring="roc_auc",
verbose=1,
n_jobs=-1)
gs_svm_kbest = GridSearchCV(svm_kbest,
grid_svm_kbest,
cv=config_cross_validation,
scoring="roc_auc",
verbose=1,
n_jobs=-1)
gs_nb = GridSearchCV(nb,
grid_nb,
cv=config_cross_validation,
scoring="roc_auc",
verbose=1,
n_jobs=-1)
gs_mlp = GridSearchCV(mlp,
grid_mlp,
cv=config_cross_validation,
scoring="roc_auc",
verbose=1,
n_jobs=-1)
Se introducen todos los GridSearchCV en un diccionario de cara a los resultados tener los pares clave - valor, es decir, descripción : el GridSearchCV.
todos_los_grid_searchs = {"gs_arbol":gs_arbol,
"gs_random_forest":gs_random_forest,
"gs_gradient_boosting":gs_gradient_boosting,
"gs_logreg_rfecv":gs_logreg_rfecv,
"gs_logreg_kbest":gs_logreg_kbest,
"gs_neighbors":gs_neighbors,
"gs_neighbors_kbest":gs_neighbors_kbest,
"gs_svm": gs_svm,
"gs_svm_kbest":gs_svm_kbest,
"gs_nb":gs_nb,
"gs_mlp":gs_mlp}
Se itera por cada par clave - valor del diccionario todos_los_grid_search, y por cada par se lanza su GridSearchCV.
for descripcion, GridSearchCV in todos_los_grid_searchs.items():
print("Haciendo Grid Search de %s..." % descripcion)
GridSearchCV.fit(train_y_val[features], train_y_val["type"])
Haciendo Grid Search de gs_arbol... Fitting 10 folds for each of 10 candidates, totalling 100 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers. [Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 3.2s [Parallel(n_jobs=-1)]: Done 85 out of 100 | elapsed: 3.9s remaining: 0.6s [Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 4.0s finished
Haciendo Grid Search de gs_gradient_boosting... Fitting 10 folds for each of 1 candidates, totalling 10 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers. [Parallel(n_jobs=-1)]: Done 6 out of 10 | elapsed: 7.4s remaining: 4.9s [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 11.1s finished
Haciendo Grid Search de gs_logreg_rfecv... Fitting 10 folds for each of 24 candidates, totalling 240 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 4.1min
[Parallel(n_jobs=-1)]: Done 184 tasks | elapsed: 19.7min
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed: 25.5min finished
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
Haciendo Grid Search de gs_logreg_kbest... Fitting 10 folds for each of 96 candidates, totalling 960 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers. [Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 2.2s [Parallel(n_jobs=-1)]: Done 184 tasks | elapsed: 12.3s [Parallel(n_jobs=-1)]: Done 434 tasks | elapsed: 38.2s [Parallel(n_jobs=-1)]: Done 784 tasks | elapsed: 1.7min [Parallel(n_jobs=-1)]: Done 960 out of 960 | elapsed: 1.8min finished
Haciendo Grid Search de gs_neighbors... Fitting 10 folds for each of 10 candidates, totalling 100 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers. [Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 21.1s [Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 53.9s finished
Haciendo Grid Search de gs_neighbors_kbest... Fitting 10 folds for each of 30 candidates, totalling 300 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers. [Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 4.1s [Parallel(n_jobs=-1)]: Done 184 tasks | elapsed: 19.9s [Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 34.3s finished
Haciendo Grid Search de gs_svm... Fitting 10 folds for each of 2 candidates, totalling 20 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers. [Parallel(n_jobs=-1)]: Done 20 out of 20 | elapsed: 1.3min finished
Haciendo Grid Search de gs_svm_kbest... Fitting 10 folds for each of 6 candidates, totalling 60 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers. [Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 31.9s [Parallel(n_jobs=-1)]: Done 60 out of 60 | elapsed: 1.0min finished
Haciendo Grid Search de gs_nb... Fitting 10 folds for each of 2 candidates, totalling 20 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers. [Parallel(n_jobs=-1)]: Done 20 out of 20 | elapsed: 0.1s finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
Haciendo Grid Search de gs_mlp... Fitting 10 folds for each of 4 candidates, totalling 40 fits
[Parallel(n_jobs=-1)]: Done 40 out of 40 | elapsed: 1.0min finished
Se pasa a analizar cuál es el mejor resultado obtenido en Cross Validation de cada GridSearchCV.
mejor_score_de_cada_GridSearchCV = [(nombre_modelo, grid_search.best_score_)
for nombre_modelo, grid_search
in todos_los_grid_searchs.items()]
mejor_score_de_cada_GridSearchCV
[('gs_arbol', 0.8257659530395809),
('gs_random_forest', 0.852297014253249),
('gs_gradient_boosting', 0.8386706170244095),
('gs_logreg_rfecv', 0.8288336008851586),
('gs_logreg_kbest', 0.833735152532108),
('gs_neighbors', 0.8332684179330515),
('gs_neighbors_kbest', 0.750097372995737),
('gs_svm', 0.8165738451866613),
('gs_svm_kbest', 0.7296318524357172),
('gs_nb', 0.746884702650629),
('gs_mlp', 0.8337427476472301)]
Para una mejor visualización de los resultados obtenidos, se crea un DataFrame ordenándolo de mejor a peor resultado.
mejor_score_de_cada_GridSearchCV_df = pd.DataFrame(mejor_score_de_cada_GridSearchCV,
columns=["GridSearchCV", "Mejor score"])
mejor_score_de_cada_GridSearchCV_df_ordenado = (mejor_score_de_cada_GridSearchCV_df
.sort_values(by="Mejor score", ascending=False)
)
mejor_score_de_cada_GridSearchCV_df_ordenado
| GridSearchCV | Mejor score | |
|---|---|---|
| 1 | gs_random_forest | 0.852297 |
| 2 | gs_gradient_boosting | 0.838671 |
| 10 | gs_mlp | 0.833743 |
| 4 | gs_logreg_kbest | 0.833735 |
| 5 | gs_neighbors | 0.833268 |
| 3 | gs_logreg_rfecv | 0.828834 |
| 0 | gs_arbol | 0.825766 |
| 7 | gs_svm | 0.816574 |
| 6 | gs_neighbors_kbest | 0.750097 |
| 9 | gs_nb | 0.746885 |
| 8 | gs_svm_kbest | 0.729632 |
El modelo ganador, como mejor GridSearchCV con el mejor roc_auc en validación cruzada, es Random Forest Classifier con un mejor score del 85.23%.
mejor_GridSearchCV = todos_los_grid_searchs["gs_random_forest"]
Se tiene dentro de este GridSearchCV, que el mejor modelo es el siguiente.
mejor_Modelo = mejor_GridSearchCV.best_estimator_
mejor_GridSearchCV.best_params_
{'class_weight': 'balanced',
'criterion': 'entropy',
'max_depth': 50,
'max_features': 'sqrt',
'min_samples_split': 7,
'n_estimators': 425}
Se re-entrena el modelo.
mejor_Modelo.fit(train_y_val[features], train_y_val["type"])
RandomForestClassifier(class_weight='balanced', criterion='entropy',
max_depth=50, max_features='sqrt', min_samples_split=7,
n_estimators=425)
Una vez que el modelo ganador ya ha sido seleccionado y entrenado con todo el conjunto de train, se procede a ver cómo predice en el conjunto de test. De esta forma, se sabrá qué resultados podemos esperar a futuro de dicho modelo.
Se saca el roc_auc en el conjunto de test (% de observaciones que el modelo clasifica bien), y la matriz de confusión.
8.1.- Roc_acu %
from sklearn.metrics import roc_auc_score
roc_auc_en_test = roc_auc_score(y_true = test["type"],
y_score = mejor_Modelo.predict(test[features])
)
print("El modelo tiene un roc_auc en el conjunto de test de %s" % roc_auc_en_test)
El modelo tiene un roc_auc en el conjunto de test de 0.8330390920554854
8.2.- Matriz de confusión
from sklearn.metrics import confusion_matrix
matriz_confusion = confusion_matrix(y_true = test["type"],
y_pred = mejor_Modelo.predict(test[features])
)
matriz_confusion
array([[3564, 96],
[ 76, 171]], dtype=int64)
Se transforma mediante un dataframe en algo visualmente más atractivo.
matriz_confusion_df = pd.DataFrame(matriz_confusion)
matriz_confusion_df.columns = ["Buenas", "Maliciosas"]
matriz_confusion_df.index = ["Buenas", "Maliciosas"]
matriz_confusion_df.columns.name = "Predicho"
matriz_confusion_df.index.name = "Real"
matriz_confusion_df
| Predicho | Buenas | Maliciosas |
|---|---|---|
| Real | ||
| Buenas | 3564 | 96 |
| Maliciosas | 76 | 171 |
Se realiza un mapa de calor para la matriz de confusión.
plt.figure(figsize=(16,12))
sns.heatmap(matriz_confusion_df,
annot=True,
cmap="Blues",
fmt='g',
annot_kws={'size': 14})
<AxesSubplot:xlabel='Predicho', ylabel='Real'>
from sklearn.metrics import classification_report
print(classification_report(test["type"],
mejor_Modelo.predict(test[features])))
precision recall f1-score support
0 0.98 0.97 0.98 3660
1 0.64 0.69 0.67 247
accuracy 0.96 3907
macro avg 0.81 0.83 0.82 3907
weighted avg 0.96 0.96 0.96 3907
El modelo ganador o mejor modelo obtenido predecirá a futuro en torno al 97% de las buenas como buenas y al 69% de las maliciosa como maliciosas. Es lo que se puede esperar del modelo a futuro.
9.1.- Guardado
import pickle
with open("mejor_Modelo_intrusiones_marcos_paricio", "wb") as archivo_salida:
pickle.dump(mejor_Modelo, archivo_salida)
9.2.- Cargado
with open("mejor_Modelo_intrusiones_marcos_paricio", "rb") as archivo_entrada:
modelo_importado = pickle.load(archivo_entrada)
modelo_importado
RandomForestClassifier(class_weight='balanced', criterion='entropy',
max_depth=50, max_features='sqrt', min_samples_split=7,
n_estimators=425)
9.3.- Comprobación de que arroja el mismo resultado de test.
roc_auc_accuracy_en_test_pipeline_cargada = roc_auc_score(y_true = test["type"],
y_score = modelo_importado.predict(test[features])
)
print("Roc_auc del modelo tras guardarlo y volver a cargarlo: %s" % roc_auc_accuracy_en_test_pipeline_cargada)
Roc_auc del modelo tras guardarlo y volver a cargarlo: 0.8330390920554854
n_c_dataset = pd.read_csv("nuevas_conexiones.csv")
n_c_dataset
| diff_srv_rate | dst_bytes | duration | flag__OTH | flag__REJ | flag__RSTO | flag__RSTR | flag__S0 | flag__SF | hot | ... | service__telnet | src_bytes | srv_count | srv_diff_host_rate | srv_rerror_rate | srv_serror_rate | su_attempted | urgent | wrong_fragment | count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.00 | 9007.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 325.0 | 3.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 11.0 | 29.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 2 | 0.01 | 57.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 187.0 | 240.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 241.0 |
3 rows × 45 columns
n_c_limpio_dataset = n_c_dataset.drop(["flag__SF", "flag__S0", "flag__REJ", "flag__RSTR", "flag__RSTO", "num_outbound_cmds", "is_host_login"], axis =1)
columns_names_features = n_c_limpio_dataset.columns.values
features_n_c_limpio_dataset = list(columns_names_features)
features_n_c_limpio_dataset
['diff_srv_rate', 'dst_bytes', 'duration', 'flag__OTH', 'hot', 'is_guest_login', 'land', 'logged_in', 'num_access_files', 'num_compromised', 'num_failed_logins', 'num_file_creations', 'num_root', 'num_shells', 'protocol_type__icmp', 'protocol_type__tcp', 'protocol_type__udp', 'rerror_rate', 'root_shell', 'same_srv_rate', 'serror_rate', 'service__domain_u', 'service__ftp', 'service__ftp_data', 'service__http', 'service__other', 'service__private', 'service__smtp', 'service__telnet', 'src_bytes', 'srv_count', 'srv_diff_host_rate', 'srv_rerror_rate', 'srv_serror_rate', 'su_attempted', 'urgent', 'wrong_fragment', 'count']
Las variables no están en el mismo orden, es decir, el orden de features_n_c_limpio_dataset es distinto al de features.
features
['duration', 'protocol_type__tcp', 'protocol_type__udp', 'protocol_type__icmp', 'service__http', 'service__private', 'service__domain_u', 'service__smtp', 'service__ftp_data', 'service__telnet', 'service__ftp', 'service__other', 'src_bytes', 'dst_bytes', 'flag__OTH', 'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate']
Se procede a solucionarlo de cara a una correcta realización de las predicciones.
n_c_limpio_ordenado_dataset = n_c_limpio_dataset[['duration',
'protocol_type__tcp',
'protocol_type__udp',
'protocol_type__icmp',
'service__http',
'service__private',
'service__domain_u',
'service__smtp',
'service__ftp_data',
'service__telnet',
'service__ftp',
'service__other',
'src_bytes',
'dst_bytes',
'flag__OTH',
'land',
'wrong_fragment',
'urgent',
'hot',
'num_failed_logins',
'logged_in',
'num_compromised',
'root_shell',
'su_attempted',
'num_root',
'num_file_creations',
'num_shells',
'num_access_files',
'is_guest_login',
'count',
'srv_count',
'serror_rate',
'srv_serror_rate',
'rerror_rate',
'srv_rerror_rate',
'same_srv_rate',
'diff_srv_rate',
'srv_diff_host_rate']]
Se comprueba.
columns_names_features = n_c_limpio_ordenado_dataset.columns.values
features_n_c_limpio_ordenado_dataset = list(columns_names_features)
features_n_c_limpio_ordenado_dataset
['duration', 'protocol_type__tcp', 'protocol_type__udp', 'protocol_type__icmp', 'service__http', 'service__private', 'service__domain_u', 'service__smtp', 'service__ftp_data', 'service__telnet', 'service__ftp', 'service__other', 'src_bytes', 'dst_bytes', 'flag__OTH', 'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate']
predicciones = modelo_importado.predict(n_c_limpio_ordenado_dataset)
predicciones
array([0, 1, 0], dtype=int64)
predicciones_nuevas_conexiones = n_c_limpio_ordenado_dataset
predicciones_nuevas_conexiones["prediccion_ml"] = predicciones
<ipython-input-408-380e87b1d1fd>:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy predicciones_nuevas_conexiones["prediccion_ml"] = predicciones
predicciones_nuevas_conexiones
| duration | protocol_type__tcp | protocol_type__udp | protocol_type__icmp | service__http | service__private | service__domain_u | service__smtp | service__ftp_data | service__telnet | ... | count | srv_count | serror_rate | srv_serror_rate | rerror_rate | srv_rerror_rate | same_srv_rate | diff_srv_rate | srv_diff_host_rate | prediccion_ml | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.00 | 1.0 | 0 |
| 1 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1.0 | 29.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.00 | 1.0 | 1 |
| 2 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 241.0 | 240.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.01 | 0.0 | 0 |
3 rows × 39 columns
predicciones_nuevas_conexiones.to_csv('predicciones_nuevas_conexiones.csv')
The End